05: UTF-16, bit twiddling, parsing Exif

Annoucements

`hexedit` solution

Let's spend a few minutes doing the hexedit assignment to give you a sense of how I expect you to approach these assignments. (And maybe to give you a sense of what Python 3 code can look like...)

(Note solutions will be posted on course site after due date.)

UTF-16

Just a few words about UTF-16.

Like UTF-8, it's an encoding of Unicode code points (numbers) into bytes. Like UTF-8, it's variable-width. Unlike UTF-8, each "unit" is 16-bits (two bytes) wide, so you have endianness to consider. But like UTF-8, it encodes some of Unicode directly. UTF-8 encodes the first 128 code points (U+00 -- U+7F) directly into one byte; UTF-16 encodes U+0000 -- U+D7FF and U+E000 -- U+FFFF directly to two bytes, and uses an four-byte (two sets of two-byte pairs) encoding for code points outside this range.

Notably, this means the ASCII subset of Unicode is encoded as sequences of alternating zero bytes and ASCII characters. For example, an ASCII encoding of "Marc" would be 4D 61 72 63; in UTF-16, it would read 00 4D 00 61 00 72 00 63 (and in little endian UTF-16, 4D 00 61 00 72 00 63 00).

Writing a "discount" UTF-16 ASCII string extractor is pretty straightforward as a result.

Bit twiddling in Python

For those of you who aren't fresh out of 230, here's a quick recap on bit twiddling.

The fundamental operations you'll need are "bitwise and" (in Python, the operator &), "bitwise or" (|), and bit shifting left (<<) and right (>>). (See, for example, https://docs.python.org/3/reference/expressions.html#binary-bitwise-operations).

For example, suppose I wanted to extract the middle four bits (out of one byte) of the number 222, and put them at the front of a byte that ends with four set bits.

destination = 0b1111
source = 222
print(bin(222)) # just to see it, don't actually manipulate strings!
# => 0b11011110

mask = 0b00111100 
# you can write masks (or any values, really!) in hex if you want: 
# 0b00111100 == 0x3c == 60

# extract the middle four bits; all others are set to zero
bits = source & mask
print(bin(bits))
# => 0b11100 # notice the leftmost zeros are elided

# shift them into position; we want them in the front four, so we need to
# move them over two
bits = bits << 2
print(bin(bits))
# => 0b1110000

# now put them into the destination:
destination = destination | bits
print(bin(destination))
# => 0b1111111

Talk to me or the TA if you need more than that to get started.

More on JPEG/JFIF

First, let's look at the JFIF format description on Wikipedia. Notice that it's structured as a series of segments, each of which has a particular two-byte marker in the format FF XX, where the XX tells you what type of marker it is. Start and end of image are obvious; many of the others are "header" information of various types.

For example, the APP0 header (FFE0) which tells you some things about the image (for example, resolution). Also notice that this header can contain an embedded thumbnail (an image within an image) and that's why we can't just go to the first FFD9 after an FFD8 when carving -- we need to try 'em all.

Now, how do recognize and parse an Exif segment? It startes with FF E1. But the Exif format is a legacy format with tons of cruft. The standard is awful (see: http://www.cipa.jp/std/documents/e/DC-008-Translation-2016-E.pdf) and easy-to-read descriptions of how to parse it are hard to come by. We're going to do an example on paper in class today, and go over it again next week in more detail as it's (a) obviously important forensically and (b) makes a great example of how to parse a binary rather than textual file format.

We're going to look at this image:

some lovely art

Pull it up in a binary viewer (like, say hexdump). You could use, for example,

hexdump -Cv FullSizeRender.jpg | head -n 11 > output.txt

to hexdump the file, keep just the first 11 lines, and send them to a file name output.txt. (I'll hand out hardcopies in class.)

It will look something like:

00000000  ff d8 ff e0 00 10 4a 46  49 46 00 01 01 00 00 48  |......JFIF.....H|
00000010  00 48 00 00 ff e1 04 dc  45 78 69 66 00 00 4d 4d  |.H......Exif..MM|
00000020  00 2a 00 00 00 08 00 09  01 0f 00 02 00 00 00 06  |.*..............|
00000030  00 00 00 7a 01 10 00 02  00 00 00 09 00 00 00 80  |...z............|
00000040  01 1a 00 05 00 00 00 01  00 00 00 8a 01 1b 00 05  |................|
00000050  00 00 00 01 00 00 00 92  01 28 00 03 00 00 00 01  |.........(......|
00000060  00 02 00 00 01 31 00 02  00 00 00 06 00 00 00 9a  |.....1..........|
00000070  01 32 00 02 00 00 00 14  00 00 00 a0 87 69 00 04  |.2...........i..|
00000080  00 00 00 01 00 00 00 b4  88 25 00 04 00 00 00 01  |.........%......|
00000090  00 00 03 d2 00 00 00 00  41 70 70 6c 65 00 69 50  |........Apple.iP|
000000a0  68 6f 6e 65 20 35 00 00  00 00 00 48 00 00 00 01  |hone 5.....H....|
...

Notes:

ffd8 => jpeg header

ffe0 => marker number (length of this field is 2)

0010 => Size, including these two bytes.
           Since 0x0010=16, read in 16-2=14 more bytes to grab the entire app
           Location of next marker is:
            = location_of_marker + length_of_marker_number + size
            = 0x2 + 0x2 + 0x10 = 0x14

ffe1=> marker number 

04dc=> size, including these two bytes. 
           0x04dc=1244, and so next 1242 bytes must be read in
           Location of next marker is 0x14+0x02+0x04dc=0x04f2

45786966004d4d002a== b‘Exif\x00\x00MM\x00\x2a'
     The first 6 bytes state that this is in fact the Exif we are looking for.
     If you don't see these bytes, then this marker app is not an Exif, and so move on. 
     0x4d4d= tells you that the exif entries are big endian ("M" stands for "Motorola")
     0x002a== constant of 42, assuming big endian. 
     (It would have been 0xII2a00 if little endian)

00000008 => ifd_offset, the offset in bytes to Image File Directory (IFD) from 0x4d (the first one in 0x4d4d). How many bytes do we need to skip ahead in order to find the start of the IFD? Well, first let's ask, How many bytes since and include 0x4d have we read in to parse the offset? 2 bytes for 0x4d4d, 2 more for 0x002a, 4 more for the offset value. That's 2+2+4 =8. So we don't have to skip ahead at all: The IFD starts next.

0009 => entries, there are 0x9 entries in this IFD. The first starts immediately.
All entries are each 12 bytes long, starting after this last byte. 
Therefore, the entry i is at offset+i*12+2
Each entry is tag,format,number_of_components,data like so: 0xttttffffnnnnnnnndddddddd

010f=> tag, in this case "Make"

0002=> format, in this case 2

0000006=> components, in this case 6

0000007a=> data, which is 0x7a or 122. Let's come back to that in a second.

That was 12 bytes. And so now tag 2 starts:

0110=> tag, in this case "Model"

0002=> format, in this case 2

00000009=> components, in this case 9

00000080=> data, in this case 80.

Here's how you parse each tag:
*   The tag converts to a string using a dictionary, which I'll provide for you.
*   The format tells you how many bytes_per_component for this tag. This is defined as
bytes_per_component = (0,1,1,2,4,8,1,1,2,4,8,4,8) (also provided)
*   The length (in bytes) of this tag's data is equal to
length = bytes_per_component[format]*components
*   If the length <=4, then the data field is the value. Otherwise, it's the offset to the data. And here, length = 1*6 = 6, which is > 4.
*   Offset from what you ask? From the first byte of the endianness marker. So to make programming easier, it's best to send the IFD to a function as an bytes object starting from the marker (0x4d / `M` in a big endian Exif).
*   Let's check that: location of 0x4d: 0x1E
*   Add offset: 0x1E+0x7A = 0x98
*   We find length of 6 bytes: : 41 70 70 6c 65 00 or "Apple"; all strings end with 0x0 (the null bit)
Let's do the second tag:
*   length is bytes_per_component[format]*components = 1*9 = 9
*   9 is greater than 4, and so the data field is an offset and not a value itself.
*   Add offset to location of 0x4d: 0x1E + 0x80 = 0x9E
*   Above we see that's 9 bytes: 69 50 68 6f 6e 65 20 35 00 or "iPhone 5".