05: JPEG and Exif

Estimated time to complete: four hours (or less, if you are experienced with Python and parsing binary file formats)

JPEG (really: JFIF) is a widely-used lossly compression format for images. JPEG files often have interesting embedded metadata in the Exif format. In this assignment, you’ll write a program that can carve JPEG files out of a larger document or disk image. Then, you’ll write a program to parse (a subset of) the Exif metadata format. Carving and parsing are two common forensic tasks, and learning to do so on JPEG/Exif is more than just an academic exercise – it also validates that the standards work the way they are documented.

Carving JPEGs

As we discussed in lecture, the JPEG File Interchange Format is the container format in which JPEG images are stored. This format is how “.jpg” files are stored on disk. JFIF has a well-defined file structure that is relatively straightforward to recognize and parse.

Many other document formats, such as Microsoft Word .doc and .docx files, or Adobe’s Portable Document Format (.pdfs), embed external media directly into their files; you can linearly scan through these documents to find JFIF data. These JPEGs might or might not be accessible in the document; they might live in “slack space” if the document format entails slack. Similarly, you can scan through disk images to extract some JPEGs (though not all, as we’ll see later in the semester), and again, files that have been deleted might still be accessible as the data.

To a first approximation, you can carve JFIF files in a straightforward, greedy way. In particular, you can:

First, linearly scan through a file (viewed as bytes) looking for and tracking the offsets of byte sequences that correspond to the “Start of Image” (FF D8) and “End of Image” (FF D9).
Then, for each (SOI, EOI) pair that “makes sense,” extract the bytes from the beginning of the SOI bytes to the end of the EOI bytes. What does it mean for a pair to “make sense”?
- The SOI should come before EOI in the file.
- (optionally) The number of bytes being extracted should be less than or equal to some pre-specified cutoff.

Why is this an approximation? Because the SOI/EOI markers are just data: they might appear anywhere in a file, including a non-JPEG file. For a given SOI marker, there may be several sensical EOI markers that correspond to it.

What to do

To start, implement a carve() method that given a file-like object, and a start and end (inclusive) offset, returns a bytes sequence containing the bytes between those offsets:

def carve(f, start, end):
    # return the bytes

    # here is an example that just returns the entire range of bytes:
    return f.read()

If you re-use the same file-like object, you will find that as you read() through it to the end, you can’t re-use it without re-opening it. Or can’t you? You can do the right thing, which is to understand that in the standard UNIXy abstraction for files, there is a way to adjust the current position within the file. This is called “seeking.” The io module documentation on seek() explains the details, but in short, you can use f.seek(n) to seek to the nth byte of the file f, starting from the beginning of the file. Use f.seek(0) to jump back to the start of the file.

This is called an seeking to an absolute position; you can seek from the current position or end of the file too; again, read the docs if you want to do so.

(We won’t actually use the carve() function to carve the bytes, though doing so and then writing them to a file is fairly straightforward. Open a file in wb mode and write() them.)

Next, implement a find_jfif() function which, when passed a binary file-like object (as returned by open(filename, 'rb')) and an optional maximum length, returns a sequence of pairs. Each pair in the sorted sequence should represent a pair of offsets into the file, to a sensical SOI/EOI pair. Your function should look something like:

def find_jfif(f, max_length=None):
    # do some stuff

    # then return a possibly-empty sequence of pairs

    # here's an example that just returns the start and end of the file without parsing
    chunk = f.read()
    last_byte = len(chunk)
    return [(0, last_byte)]

Try to get this working before moving on.

590F students: As mentioned above, SOI/EOI markers are just data and might appear anywhere in a file. You can do a little more parsing to see if the data range is likely to actually be in JFIF format.

In particular, you can verify that within the range of bytes that the JFIF format is followed. Each SOI should immediately be followed by one or more correctly-formatted segments, then a single SOS (Start of Scan) marker, image data, and a EOI. You don’t need to parse each segment completely (and in fact should not, as the autograder spoofs some of the data), but you should check that each segment starts with a valid tag, then the segment’s length, and then make sure that another segment (or the SOI and image data) start immediately thereafter, etc. Almost any 0xFF?? value is valid as a segment tag, but there are four exceptions: 0xFFD8 and 0xFFD9 are not valid, since they are the SOI/EOI tags. But 0xFF00 and 0xFFFF are also invalid, per the specification.

(As a side note, there are additional checks you can do. Within the image data only, FF bytes will almost always be followed by 00 bytes (see https://en.wikipedia.org/wiki/JPEG#Syntax_and_structure); if not, it’s not a valid JPEG! The only FF XX bytes you will see when scanning the image data will be FF D9 – the EOI marker. There are exceptions to this rule (read the JPEG spec if you’re curious) but they’re more nitpicky than they are worth for this assignment. Just checking segment markers and sizes as described in the previous paragraph gets you 90% of the way there and is the only check you should do. Do not try to detect correctly byte-stuffed image data as described in this parenthesized paragraph! Just parse the segments as described in the previous paragraph.)

590F: modify your find_jfif() function to take an additional boolean argument parse (defaulting to False) that determines if it performs this additional parsing of the segment markers or not, that is:

def find_jfif(f, max_length=None, parse=False):
    # ...

will return fewer, but more likely-to-be-valid, offsets to JFIFs if parse is set to True.

Parsing Exif

So now you’ve got a way to extract JFIF files. What can you do with them once you’ve got them? Find and parse the Exif data (if it exists), of course!

Generally, parsing Exif will follow the process we did manually in class. First, check if the file starts with an SOI marker. Then, read through the segments until you find the first Exif segment. Parse this segment:

Determine whether it’s a big-endian or little-endian; handle multibyte data after this point appropriately.
Find the start of the IFD.
Determine the number of entries.
For each entry:
- Determine its name from its tag. We provide a file tags.py containing the list of names you must handle; if the tag is not in this file, skip the entire entry. You’ll need to import this file to access its one defined variable.
- If its type (format) is format is one of 1, 2, 3, 4, 5, or 7, then parse out its value(s), which may be stored in the data field, or in an offset pointed to by the data field. If the value is a single number, store it as such; if it is more than one number, store it as a list of numbers; if it is textual, store the parsed text (that is, as a native Python string – strip the trailing NUL (00) byte and decode it as UTF-8); if it is undefined (raw), store it as a string in hexadecimal format; if it is a rational number, store it as a string numerator/denominator with appropriate values rather than text.
  
  For some types, you need only parse and store the first value; for some you’ll need to store several (like text). We provide a format table, which is a cheat sheet derived from the Exif specification about each possible format, how to use struct.unpack to decode its values, and whether we want you to get the first or all components.
If there is another IFD, repeat. (Usually the second IFD is for the thumbnail, if any, embedded in the JFIF.)

Store and return the parsed entries in a Python dictionary:

Each key in the dictionary should be a string, taken from the list of valid tags in tags.py, corresponding to an entry in the IFD(s). If the tag is not in tags.py, then there should be no entry in the dictionary.
Each value in the dictionary is a list. Each time you see an IFD tag, you should append to the corresponding parsed value to the list in the dictionary. (You can look at the tests below to see the expected format.)
If the type (format) of the field is in the list of formats we asked you to parse, the value associated with this key should be the parsed value: it will either be a number, a list of numbers, or a string. If the type is not in the list of formats we asked you to parse, set it to None (that is, the Python None type, not the string "None").
If the file was unparseable, that is, if it was not a valid JFIF, or if there was no Exif data,then raise an ExifParseError. (See https://docs.python.org/3.5/tutorial/errors.html#user-defined-exceptions and the starter code.)

What to do

The above tasks should all be handled by a parse_exif() function which takes a single argument, a file-like object containing (only and exactly) one JFIF.

def parse_exif(f):
    # do it!

    # ...

    # Don't hardcode the answer! Return your computed dictionary.
    return {'Make':['Apple'], ...}

What to submit

Submit a single Python file named jpeg_exif.py. This file must define the three functions above, along with an ExifParseError, and should assume that tags.py is available for import. In other words, it should look like the following, but with actual implementations rather than placeholders for each function.

import tags

class ExifParseError(Exception):
    def init(__self__):
        pass


def carve(f, start, end):
    # return the bytes

    # here is an example that just returns the entire range of bytes:
    return f.read()


def find_jfif(f, max_length=None):
    # do some stuff

    # then return a possibly-empty sequence of pairs

    # here's an example that just returns the start and end of the file without parsing
    chunk = f.read()
    last_byte = len(chunk)
    return [(0, last_byte)]


def parse_exif(f):
    # do it!

    # ...

    # Don't hardcode the answer! Return your computed dictionary.
    return {'Make':['Apple']}

Tests

I’m not a total monster, so for this assignment I’m providing a subset of the Gradescope tests. Below is a set of Python unit tests and associated files. To use them, place them all in the same directory as your jpeg_exif.py file. Then, pick one to run and execute it at the command line:

> python3.5 test_carve.py 
....
----------------------------------------------------------------------
Ran 4 tests in 0.001s

OK

If the test passes, you’ll see OK as above. If not, you’ll get details about which test failed, and how the expected output (the first operand of the assertEqual method) differed from your code’s output (the second operand).

The tests:

test_carve.py: This file contains standalone tests for carve().
test_find_jfif.py: This file contains tests everyone’s code must pass for find_jfif(). It depends upon Designs.doc.
test_parse_exif.py: This file contains tests everyone’s code must pass for parse_exif(). It depends upon FullSizeRender.jpg and gore-superman.jpg, both of which contains Exif data in big-endian byte order.
test_find_jfif_parse.py: This file contains tests 590F students’ code must pass for find_jfif(). It depends upon Designs.doc and minimal.jpg.
test_parse_exif_little_endian.py: This file contains a test everyone’s code must pass for parse_exif(). It depends upon leaves.jpg, which contains Exif data in little-endian byte order.

The media:

Designs.doc: a MS-Word document containing carveable JFIFs as well as non-JFIF FF D8 and FF D9 values.
minimal.jpg: a minimal JFIF with no Exif.
FullSizeRender.jpg: a big-endian Exif-containing file with a single IFD.
gore-superman.jpg: a big-endian Exif-containing file with multiple IFDs.
leaves.jpg: a little-endian Exif-containing file with multiple IFDs.