05: Bit twiddling, file formats, parsing Exif

Welcome

Annoucements

Add/drop is over. Buckle up.

Accomodations: If you have DS accommodations, you need to contact me about what you want within the scope of the accommodations, and I need time to set it up.

It appears Gradescope can’t easily be updated to use Python 3.6, so right now it’s on 3.5. I have a request in to see if they can make it straightforward for us to go to 3.6. It shouldn’t be a big deal either way.

Bit twiddling in Python

So, the next assignment will involve you doing some bit twiddling – that is, manipulation of bytes at the bit level.

For those of you who aren’t fresh out of 230, here’s a quick recap on bit twiddling.

The fundamental operations you’ll need are “bitwise and” (in Python, the operator &), “bitwise or” (|), and bit shifting left (<<) and right (>>). (See, for example, https://docs.python.org/3/reference/expressions.html#binary-bitwise-operations).

For example, suppose I wanted to extract the middle four bits (out of one byte) of the number 222, and put them at the front of a byte that ends with four set bits.

destination = 0b1111
source = 222
print(bin(222)) # just to see it, don't actually manipulate strings!
# => 0b11011110

mask = 0b00111100 
# you can write masks (or any values, really!) in hex if you want: 
# 0b00111100 == 0x3c == 60

# extract the middle four bits; all others are set to zero
bits = source & mask
print(bin(bits))
# => 0b11100 # notice the leftmost zeros are elided

# shift them into position; we want them in the front four, so we need to
# move them over two
bits = bits << 2
print(bin(bits))
# => 0b1110000

# now put them into the destination:
destination = destination | bits
print(bin(destination))
# => 0b1111111

Again, you should not be operating on the string representation of the bytes! In other words, don’t call bin() then manipulate the resulting value of type string! You should operate on the underlying byte (which will be of type int). It’s orders of magnitude more efficient and the only reasonable way to do things in real bit twiddling code.

Talk to me or the course assistants if you need more than that to get started.

Parsing JPEGs (to get to EXIFs)

First, let’s look at the JFIF format description on Wikipedia. Notice that it’s structured as a series of segments, each of which has a particular two-byte marker in the format FF XX, where the XX tells you what type of marker it is. Start and end of image are obvious; many of the others are “header” information of various types.

For example, the APP0 header (FFE0) which tells you some things about the image (for example, resolution). Also notice that this header can contain an embedded thumbnail (an image within an image) and that’s why we can’t just go to the first FFD9 after an FFD8 when carving – we need to try ‘em all.

How are we gonna deal with this? Some background first.

File formats: Back to the beginning

File formats are a formal (or informal) way to specify how information is arranged on disk (more generally, formats can apply to memory, too).

Formats can be textual (which just means printable ASCII) or binary. To read text formats, we can just open them up (or write a parser). To read binary, we can view them in a hex editor, or we can write a parser.

Textual file formats

HTML and JSON are file formats most of you are familiar with.

HTML separates markup (“tags” – control information, or metadata) from text by enclosing markup in pairs of angle brackets ‘<>’. This is the syntax. The meaning of each tag is its semantics. To extract the tags, you can imagine that we could write a program that would linearly scan through an HTML file, starting in “text” mode. Capture as you go, watching for open brackets, and switching into “tag” mode when one is encountered. Then switch back to “text” mode when you see the close bracket.

Problems? Sure. You need to read the spec to understand about comments. And what about incorrect (technically invalid) formatted files? Happens all the time in textual mode (since humans are directly editing them). Should you fail or attempt to recover? But you get the idea.

JSON is similar, but more complicated. For example, here’s some code from the autograder specifying part of the assignment.

{
  "type": "single",
  "file": "strings.py",
  "wpo_tests": [
    {
      "score": 5,
      "name": "simple UTF-8 length 2",
      "command": "python3.5 strings.py -n 2 simple-utf8.txt",
      "expected": "simple-utf8.txt.2.expected"
    },
...

But still, you can see the basic structure: curly braces are start-of-dictionary and colons separate keys from values; quotes are start-of-string; integers are literals, square braces are start-of-list, commas separate items, and so on. This isn’t a compilers class so we won’t go into detail here, but there’s a simple grammar that can be represented in various ways (see http://json.org/ for two: a visual representation and a chart of tokens). Again, there are standard techniques to build general parsers based upon an input grammar, but we’re going to hand-write our parsers / carvers in this class.

Dividing data

In both JSON and HTML (and indeed, in most text-based, easily-human-readable formats) there’s widespread use of delimiters, marker characters or strings that are used to show where one element ends and another begins. A very common older use of this is so-called “NUL terminated strings”, where ASCII (or perhaps UTF-8) data is stored starting in a known location, and continues until it stops. How do you know it stops? There’s a NUL (0x00) byte after it.

NUL-terminated strings come from the land of C and are the source of many potential problems in programs (for example, what if you forget to write the NUL byte? Or what if it gets overwritten in a fixed-size array in C?).

Anyway, markers are generally convenient and readable, but there are other approaches that have other benefits (and drawbacks).

One approach is to use a fixed amount of space for an element. Sometimes this is called using a “fixed-width” field, and in binary formats, width is usually measured in bytes. Essentially, the format hardcodes something into place. Like, “the next four bytes will be an unsigned int representing the street number” or the like. When one or more fixed-width fields are present, the programmer knows exactly how far to seek ahead to access any particular one.

Another approach is to explicitly embed something about the length of a field into another field. Usually called “length” or “size” fields, these fields represent either the length of another field, or the number of records (of one or more field) that follow (where the latter are usually fixed-size).

Extracting data

So, suppose we have a format where we know that a particular group of bytes represents a value. If it’s text, we know how to parse it (right?). But what if it’s a numeric value? It depends upon how it’s encoded: endianness, sign, and width are the determinants to figure it out. I’m not going to make you write a general value encoder/decoder, but you will need to know about Python’s. It’s the struct module.

import struct

# suppose we have four bytes
some_bytes = b'\xC2\x40\xD1\x2A'

# what number are these bytes, as a two two-byte unsigned big-endian integers (aka unsigned shorts)?
struct.unpack('>HH', some_bytes)

# the result is a tuple, since unpack can return more than one value.
# => (49728, 53546)

# tuples can be indexed just like lists
struct.unpack('>HH', some_bytes)[0]
# => 49728

# what if we read the bytes as a single signed big-endian integer (a four byte unsigned int)?
struct.unpack('>I', some_bytes)[0]
# => 3259027754

# or a signed little-endian int?
struct.unpack('<i', some_bytes)[0]
# => struct.unpack('<i', some_bytes)[0]

More on JPEG/JFIF

So, we’re looking at the JFIF format description on Wikipedia.

Now, how do recognize and parse an Exif segment? It startes with FF E1. But the Exif format is a legacy format with tons of cruft. The standard is awful (see: http://www.cipa.jp/std/documents/e/DC-008-Translation-2016-E.pdf) and easy-to-read descriptions of how to parse it are hard to come by.

Here’s a one such EXIF spec.

And here’s a cheat sheet for IFD formats that will make more sense with some context.

We’re going to do an example together in class today, and go over it again next class in more detail as it’s (a) obviously important forensically and (b) makes a great example of how to parse a binary rather than textual file format.

We’re going to look at this image:

some lovely art

Pull it up in a binary viewer (like, say hexdump). You could use, for example,

hexdump -Cv FullSizeRender.jpg | head -n 11 > output.txt

to hexdump the file, keep just the first 11 lines, and send them to a file name output.txt. (I’ll hand out hardcopies in class.)

It will look something like:

00000000  ff d8 ff e0 00 10 4a 46  49 46 00 01 01 00 00 48  |......JFIF.....H|
00000010  00 48 00 00 ff e1 04 dc  45 78 69 66 00 00 4d 4d  |.H......Exif..MM|
00000020  00 2a 00 00 00 08 00 09  01 0f 00 02 00 00 00 06  |.*..............|
00000030  00 00 00 7a 01 10 00 02  00 00 00 09 00 00 00 80  |...z............|
00000040  01 1a 00 05 00 00 00 01  00 00 00 8a 01 1b 00 05  |................|
00000050  00 00 00 01 00 00 00 92  01 28 00 03 00 00 00 01  |.........(......|
00000060  00 02 00 00 01 31 00 02  00 00 00 06 00 00 00 9a  |.....1..........|
00000070  01 32 00 02 00 00 00 14  00 00 00 a0 87 69 00 04  |.2...........i..|
00000080  00 00 00 01 00 00 00 b4  88 25 00 04 00 00 00 01  |.........%......|
00000090  00 00 03 d2 00 00 00 00  41 70 70 6c 65 00 69 50  |........Apple.iP|
000000a0  68 6f 6e 65 20 35 00 00  00 00 00 48 00 00 00 01  |hone 5.....H....|
...

Notes:

ffd8 => jpeg header

ffe0 => marker number (length of this field is 2)

0010 => Size, including these two bytes.
           Since 0x0010=16, read in 16-2=14 more bytes to grab the entire app
           Location of next marker is:
            = location_of_marker + length_of_marker_number + size
            = 0x2 + 0x2 + 0x10 = 0x14

ffe1=> marker number 

04dc=> size, including these two bytes. 
           0x04dc=1244, and so next 1242 bytes must be read in
           Location of next marker is 0x14+0x02+0x04dc=0x04f2

45786966004d4d002a== b‘Exif\x00\x00MM\x00\x2a'
     The first 6 bytes state that this is in fact the Exif we are looking for.
     If you don't see these bytes, then this marker app is not an Exif, and so move on. 
     0x4d4d= tells you that the exif entries are big endian ("M" stands for "Motorola")
     0x002a== constant of 42, assuming big endian. 
     (It would have been 0xII2a00 if little endian)

00000008 => ifd_offset, the offset in bytes to Image File Directory (IFD) from 0x4d (the first one in 0x4d4d). How many bytes do we need to skip ahead in order to find the start of the IFD? Well, first let's ask, How many bytes since and include 0x4d have we read in to parse the offset? 2 bytes for 0x4d4d, 2 more for 0x002a, 4 more for the offset value. That's 2+2+4 =8. So we don't have to skip ahead at all: The IFD starts next.

0009 => entries, there are 0x9 entries in this IFD. The first starts immediately.
All entries are each 12 bytes long, starting after this last byte. 
Therefore, the entry i is at offset+i*12+2
Each entry is tag,format,number_of_components,data like so: 0xttttffffnnnnnnnndddddddd

010f=> tag, in this case "Make"

0002=> format, in this case 2

0000006=> components, in this case 6

0000007a=> data, which is 0x7a or 122. Let's come back to that in a second.

That was 12 bytes. And so now tag 2 starts:

0110=> tag, in this case "Model"

0002=> format, in this case 2

00000009=> components, in this case 9

00000080=> data, in this case 80.

Here's how you parse each tag:
*   The tag converts to a string using a dictionary, which I'll provide for you.
*   The format tells you how many bytes_per_component for this tag. This is defined as
bytes_per_component = (0,1,1,2,4,8,1,1,2,4,8,4,8) (also provided)
*   The length (in bytes) of this tag's data is equal to
length = bytes_per_component[format]*components
*   If the length <=4, then the data field is the value. Otherwise, it's the offset to the data. And here, length = 1*6 = 6, which is > 4.
*   Offset from what you ask? From the first byte of the endianness marker. So to make programming easier, it's best to send the IFD to a function as an bytes object starting from the marker (0x4d / `M` in a big endian Exif).
*   Let's check that: location of 0x4d: 0x1E
*   Add offset: 0x1E+0x7A = 0x98
*   We find length of 6 bytes: : 41 70 70 6c 65 00 or "Apple"; all strings end with 0x0 (the null bit)
Let's do the second tag:
*   length is bytes_per_component[format]*components = 1*9 = 9
*   9 is greater than 4, and so the data field is an offset and not a value itself.
*   Add offset to location of 0x4d: 0x1E + 0x80 = 0x9E
*   Above we see that's 9 bytes: 69 50 68 6f 6e 65 20 35 00 or "iPhone 5".

More on this next class.