06: More on File Formats

Announcements

If you're auditing (unofficially or otherwise) and you want not to be dropped from Gradescope and/or Piazza please let me know. We're going to purge the rosters soon.

Gaming the autograder:

def decode(bytes_object):
    i = 0
    while (i <= 184320):
        if(encode(i) == bytes_object):
            return i
        i = i + 1

UTF-16 parsing

UTF-8 (ASCII subset) parsing: remember our state machine? Do that.

UTF-16 parsing is the same! Read a pair of bytes. UTF-16 decode it and check if it's ASCII (or, just check if one byte is 0x00 and the other is ASCII).

File formats: Back to the beginning

So. Maybe last Tuesday's jump to Exif parsing was a little too fast. Let's spend a few minutes more on some background.

File formats are a formal (or informal) way to specify how information is arranged on disk (more generally, formats can apply to memory, too).

Formats can be textual (which just means printable ASCII) or binary. To read text formats, we can just open them up (or write a parser). To read binary, we can view them in a hex editor, or we can write a parser.

Textual file formats

HTML and JSON are file formats most of you are familiar with.

HTML separates markup ("tags" -- control information, or metadata) from text by enclosing markup in pairs of angle brackets '<>'. This is the syntax. The meaning of each tag is its semantics. To extract the tags, you can imagine that we could write a program that would linearly scan through an HTML file, starting in "text" mode. Capture as you go, watching for open brackets, and switching into "tag" mode when one is encountered. Then switch back to "text" mode when you see the close bracket.

Problems? Sure. You need to read the spec to understand about comments. And what about incorrect (technically invalid) formatted files? Happens all the time in textual mode (since humans are directly editing them). Should you fail or attempt to recover? But you get the idea.

JSON is similar, but more complicated. For example, here's some code from the autograder specifying part of the assignment.

{
  "type": "single",
  "file": "strings.py",
  "wpo_tests": [
    {
      "score": 5,
      "name": "simple UTF-8 length 2",
      "command": "python3.5 strings.py -n 2 simple-utf8.txt",
      "expected": "simple-utf8.txt.2.expected"
    },
...

But still, you can see the basic structure: curly braces are start-of-dictionary and colons separate keys from values; quotes are start-of-string; integers are literals, square braces are start-of-list, commas separate items, and so on. This isn't a compilers class so we won't go into detail here, but there's a simple grammar that can be represented in various ways (see http://json.org/ for two: a visual representation and a chart of tokens). Again, there are standard techniques to build general parsers based upon an input grammar, but we're going to hand-write our parsers / carvers in this class.

Dividing data

In both JSON and HTML (and indeed, in most text-based, easily-human-readable formats) there's widespread use of delimiters, marker characters or strings that are used to show where one element ends and another begins. A very common older use of this is so-called "NUL terminated strings", where ASCII (or perhaps UTF-8) data is stored starting in a known location, and continues until it stops. How do you know it stops? There's a NUL (0x00) byte after it.

hello = 'hello world!'
hello.encode()
len(hello.encode())
hello.encode() + b'\x00'
len(hello.encode() + b'\x00')

NUL-terminated strings come from the land of C and are the source of many potential problems in programs (for example, what if you forget to write the NUL byte? Or what if it gets overwritten in a fixed-size array in C?).

Anyway, markers are generally convenient and readable, but there are other approaches that have other benefits (and drawbacks).

One approach is to use a fixed amount of space for an element. Sometimes this is called using a "fixed-width" field, and in binary formats, width is usually measured in bytes. Essentially, the format hardcodes something into place. Like, "the next four bytes will be an unsigned int representing the street number" or the like. When one or more fixed-width fields are present, the programmer knows exactly how far to seek ahead to access any particular one.

Another approach is to explicitly embed something about the length of a field into another field. Usually called "length" or "size" fields, these fields represent either the length of another field, or the number of records (of one or more field) that follow (where the latter are usually fixed-size).

Extracting data

So, suppose we have a format where we know that a particular group of bytes represents a value. If it's text, we know how to parse it (right?). But what if it's a numeric value? It depends upon how it's encoded: endianness, sign, and width are the determinants to figure it out. I'm not going to make you write a general value encoder/decoder, but you will need to know about Python's. It's the struct module.

import struct

# suppose we have four bytes
some_bytes = b'\xC2\x40\xD1\x2A'

# what number are these bytes, as a two two-byte unsigned big-endian integers (aka unsigned shorts)?
struct.unpack('>HH', some_bytes)

# the result is a tuple, since unpack can return more than one value.
# => (49728, 53546)

# tuples can be indexed just like lists
struct.unpack('>HH', some_bytes)[0]
# => 49728

# what if we read the bytes as a single signed big-endian integer (a four byte unsigned int)?
struct.unpack('>I', some_bytes)[0]
# => 3259027754

# or a signed little-endian int?
struct.unpack('<i', some_bytes)[0]
# => struct.unpack('<i', some_bytes)[0]

Back to JPEG / JFIF / EXIF

JFIF file structure:

Segment Code/Marker Description
SOI FF D8 Start of Image
JFIF-APP0 FF E0 s1 s2 4A 46 49 46 00... tag, size, data...
JFXX-APP0 FF E0 s1 s2 4A 46 58 58 00... tag, size, data...
...optionally more fields...
SOS FF DA Start of Scan
...compressed image data...
EOI FF D9 End of Image

Here we see examples of markers (the start of segment codes) as well as size fields.

Now, note that FFD8 and FFD9 are special. They note that start and end of the JFIF (JPEG file).

Aside: So we can in theory parse any container format (a word file, a disk image, etc.), looking for pairs of these byte patters to extract JPEGs. It's not quite that easy. The compressed image data will never contain the FFD9 (or indeed, any marker; FF bytes are byte-stuffed to always be followed by 00). But some of the fields might contain FFD9, so when carving, we have to potentially carve between each (FFD8, FFD9) pair that occur in that order. (Most JFIF decoders will ignore trailing junk after the FFD9, so getting one "too late" won't hurt.)

Back to Exif.

EXIF is stored in segments that are marked with FF E1. Then the format looks like:

FF E1   SS SS    45 78 69 66 00 00  4d 4d    00 2a
marker  size(BE) Exif\x00\x00       endian   magic value 42

Let's turn to an example. Here's the top of our file FullSizeRender.jpg in hexdump:

00000000  ff d8 ff e0 00 10 4a 46  49 46 00 01 01 00 00 48  |......JFIF.....H|
00000010  00 48 00 00 ff e1 04 dc  45 78 69 66 00 00 4d 4d  |.H......Exif..MM|
00000020  00 2a 00 00 00 08 00 09  01 0f 00 02 00 00 00 06  |.*..............|
00000030  00 00 00 7a 01 10 00 02  00 00 00 09 00 00 00 80  |...z............|
00000040  01 1a 00 05 00 00 00 01  00 00 00 8a 01 1b 00 05  |................|
00000050  00 00 00 01 00 00 00 92  01 28 00 03 00 00 00 01  |.........(......|
00000060  00 02 00 00 01 31 00 02  00 00 00 06 00 00 00 9a  |.....1..........|
00000070  01 32 00 02 00 00 00 14  00 00 00 a0 87 69 00 04  |.2...........i..|
00000080  00 00 00 01 00 00 00 b4  88 25 00 04 00 00 00 01  |.........%......|
00000090  00 00 03 d2 00 00 00 00  41 70 70 6c 65 00 69 50  |........Apple.iP|
000000a0  68 6f 6e 65 20 35 00 00  00 00 00 48 00 00 00 01  |hone 5.....H....|
...

Let's look at this in Python:

import struct

all_bytes = open('FullSizeRender.jpg', 'rb').read()

print(all_bytes[0:2]) # the start of image marker
# b'\xff\xd8'

print(all_bytes[2:4]) # the APP0 entry marker, starting at byte 2 (inclusive)going to byte 4 (exclusive)
# b'\xff\xe0'

print(all_bytes[4:6]) # the APP0 entry's size
# b'\x00\x10'

print(struct.unpack('>H', all_bytes[4:6])[0]) # the size as a number
# 16

print(all_bytes[4 + 16: 4 + 16 + 2]) # so the entry starts at the start of the old entry (4) + the offset (16) and goes two bytes past that:
# b'\xff\xe1'

# let's carve out just this part of the file:
exif_bytes = all_bytes[4 + 16:]

# now byte "0" in exif_bytes is the start of the exif entry
print(exif_bytes[0:2])
# b'\xff\xe1'

# how big is it?
print(struct.unpack('>H', exif_bytes[2:4])[0]
# 1244

# the next six bytes are the exif tag:
print(exif_bytes[4:10])
# b'Exif\x00\x00'

# then the endian marker:
print(exif_bytes[10:12])
# b'MM'

# then the magic value 42, encoded as a two-byte value (to check endianness, I guess):
print(struct.unpack('>H', exif_bytes[12:14])[0])
# 42

Then offset to the first IFD, starting from the first byte of the endian marker, stored in 4 bytes:

?? ?? ?? ??

It almost always equals 0x00000008, which means "immediately following this value."

Note that every other offset after this point is from the endian marker, so it's not a bad idea to slice the bytes array again here.

print(exif_bytes[14:18])
# b'\x00\x00\x00\x08'
bom_bytes = exif_bytes[10:]

# what's the offset to the IFD within bom_bytes?
ifd_start = struct.unpack('>I', exif_bytes[14:18])[0]

Then comes the IFD. The IFD looks like:

EE EE  -- two bytes, number of entries
entries, fixed size
LL LL LL LL -- four bytes, offset to next IFD
# how many entries?
print(struct.unpack('>H', bom_bytes[ifd_start:ifd_start+2])[0])
# 9
# the entries start immediately thereafter; each is 12 bytes long

Why might there be space until the next IFD? Because entries are fixed size and might need to hold variable-sized data. So they'll "point" into the space between IFDs where variably-sized data can be stored.

An entry looks like:

TT TT ff ff NN NN NN NN DD DD DD DD

where:

  • T is the tag number (I'll give you this when you program; it's in a table)
  • f is the format code, or "type" (it tells you what type of data is being stored in this entry)
  • N is the "number of components" -- the number of entries of type f being stored; the total size of the data is the sizeof(type(f)) * N
  • D is either the data (if it would fit in four bytes) or the offset to the data (if not; remember, offsets are from the first endianness marker byte)

Let's parse the first one:

# first, the tag number
print(bom_bytes[ifd_start+2:ifd_start+4])
# b'\x01\x0f'
# so we look this up in the table (see reading for the Exif spec if you want) and see it's the "Make" field

# next, the format code, or type
print(struct.unpack('>H', bom_bytes[ifd_start+4:ifd_start+6])[0])
# 2, which is ASCII; again see the spec or the table I'll provide

# next, the "number of components"
print(struct.unpack('>I', bom_bytes[ifd_start+6:ifd_start+10])[0])
# 6

# ASCII entries are 1 byte per component (character), and there are six of them,
# so it won't fit in a 4-byte data field. Therefore the data field is an offset from 
# the endian marker:
print(struct.unpack('>I', bom_bytes[ifd_start+10:ifd_start+14])[0])
# 122

# so the value of the "Make" tag is 122 bytes from the endian marker, and is a six-byte long ASCII string (NUL terminated):
print(bom_bytes[122:122 + 6])
# b'Apple\x00'

# we can trim the nul and convert to a Python string:
make = bom_bytes[122:122 + 6]
make[0:-1].decode() # note "0:-1" means "from first to last-but-one"
# 'Apple'