05: JPEG and Exif

Estimated time to complete: four hours (or less, if you are experienced with Python and parsing binary file formats)

JPEG (really: JFIF) is a widely-used lossly compression format for images. JPEG files often have interesting embedded metadata in the Exif format. In this assignment, you'll write a program that can carve JPEG files out of a larger document or disk image. Then, you'll write a program to parse (a subset of) the Exif metadata format. Carving and parsing are two common forensic tasks, and learning to do so on JPEG/Exif is more than just an academic exercise -- it also validates that the standards work the way they are documented.

Carving JPEGs

As we discussed in lecture, the JPEG File Interchange Format is the container format in which JPEG images are stored. This format is how ".jpg" files are stored on disk. JFIF has a well-defined file structure that is relatively straightforward to recognize and parse.

Many other document formats, such as Microsoft Word .doc and .docx files, or Adobe's Portable Document Format (.pdfs), embed external media directly into their files; you can linearly scan through these documents to find JFIF data. These JPEGs might or might not be accessible in the document; they might live in "slack space" if the document format entails slack. Similarly, you can scan through disk images to extract some JPEGs (though not all, as we'll see later in the semester), and again, files that have been deleted might still be accessible as the data.

To a first approximation, you can carve JFIF files in a straightforward, greedy way. In particular, you can:

  • First, linearly scan through a file (viewed as bytes) looking for and tracking the offsets of byte sequences that correspond to the "Start of Image" (FF D8) and "End of Image" (FF D9).
  • Then, for each (SOI, EOI) pair that "makes sense," extract the bytes from the beginning of the SOI bytes to the end of the EOI bytes. What does it mean for a pair to "make sense"?
    • The SOI should come before EOI in the file.
    • (optionally) The number of bytes being extracted should be less than or equal to some pre-specified cutoff.

Why is this an approximation? Because the SOI/EOI markers are just data: they might appear anywhere in a file, including a non-JPEG file. For a given SOI marker, there may be several sensical EOI markers that correspond to it.

What to do

To start, implement a carve() method that given a file-like object, and a start and end (inclusive) offset, returns a bytes sequence containing the bytes between those offsets:

def carve(f, start, end):
    # return the bytes

    # here is an example that just returns the entire range of bytes:
    return f.read()

If you re-use the same file-like object, you will find that as you read() through it to the end, you can't re-use it without re-opening it. Or can't you? You can do the right thing, which is to understand that in the standard UNIXy abstraction for files, there is a way to adjust the current position within the file. This is called "seeking." The io module documentation on seek() explains the details, but in short, you can use f.seek(n) to seek to the nth byte of the file f, starting from the beginning of the file. Use f.seek(0) to jump back to the start of the file.

This is called an seeking to an absolute position; you can seek from the current position or end of the file too; again, read the docs if you want to do so.

(We won't actually use the carve() function to carve the bytes, though doing so and then writing them to a file is fairly straightforward. Open a file in wb mode and write() them.)

Next, implement a find_jfif() function which, when passed a binary file-like object (as returned by open(filename, 'rb')) and an optional maximum length, returns a sequence of pairs. Each pair in the sorted sequence should represent a pair of offsets into the file, to a sensical SOI/EOI pair. Your function should look something like:

def find_jfif(f, max_length=None):
    # do some stuff

    # then return a possibly-empty sequence of pairs

    # here's an example that just returns the start and end of the file without parsing
    chunk = f.read()
    last_byte = len(chunk)
    return [(0, last_byte)]

Try to get this working before moving on.

590F students: As mentioned above, SOI/EOI markers are just data and might appear anywhere in a file. You can do a little more parsing to see if the data range is likely to actually be in JFIF format.

In particular, you can verify that within the range of bytes that the JFIF format is followed. Each SOI should immediately be followed by one or more correctly-formatted segments, then a single SOS (Start of Scan) marker, image data, and a EOI. You don't need to parse each segment completely, but you should check each segment's length, and make sure that another segment (or the SOI and image data) start immediately thereafter.

Update: Skip the check described in this next paragraph. There are exceptions to it (read the JPEG spec if you're curious) but they're more nitpicky than they are worth for this assignment. Just checking segment markers and sizes as described in the previous paragraph gets you 90% of the way there. Within the image data only, FF bytes will always be followed by 00 bytes (see https://en.wikipedia.org/wiki/JPEG#Syntax_and_structure); if not, it's not a valid JPEG! The only FF XX bytes you will see when scanning the image data will be FF D9 -- the EOI marker.

590F: modify your find_jfif() function to take an additional boolean argument parse (defaulting to False) that determines if it performs this additional parsing of the segment markers or not, that is:

def find_jfif(f, max_length=None, parse=False):
    # ...

will return fewer, but more likely-to-be-valid, offsets to JFIFs if parse is set to True.

Parsing Exif

So now you've got a way to extract JFIF files. What can you do with them once you've got them? Find and parse the Exif data (if it exists), of course!

Generally, parsing Exif will follow the process we did manually in class. First, check if the file starts with an SOI marker. Then, read through the segments until you find the first Exif segment. Parse this segment:

  • 365: Confirm it's a big-endian Exif block. If not, you can signal an exception, as described below.
  • 590F: Your parser should handle both little- and big-endian Exif!
  • Find the start of the IFD.
  • Determine the number of entries.
  • For each entry:

    • Determine its name from its tag. We provide a file tags.py containing the list of names you must handle; if the tag is not in this file, skip the entire entry. You'll need to import this file to access its one defined variable.
    • If its type (format) is format is one of 1,2,3,4,5, or 7, then parse out its value(s), which may be stored in the data field, or in an offset pointed to by the data field. If the value is a single number, store it as such; if it is more than one number, store it as a list of numbers; if it is textual, store the parsed (UTF-8) text, with the trailing NUL (00) byte removed; if it is undefined (raw), store it as a string in hexadecimal format; if it is a rational number, store it as a string numerator/denominator with appropriate values rather than text.

      For some types, you need only parse and store the first value; for some you'll need to store several (like text). We provide a format table, which is a cheat sheet derived from the Exif specification about each possible format, how to use struct.unpack to decode its values, and whether we want you to get the first or all components.

  • If there is another IFD, repeat. (Usually the second IFD is for the thumbnail, if any, embedded in the JFIF.)

Store and return the parsed entries in a Python dictionary:

  • Each key in the dictionary should be a string, taken from the list of valid tags in tags.py, corresponding to an entry in the IFD(s). If the tag is not in tags.py, then there should be no entry in the dictionary.
  • If the type (format) of the field is in the list of formats we asked you to parse, the value associated with this key should be the parsed value: it will either be a number, a list of numbers, or a string. If the type is not in the list of formats we asked you to parse, set it to None (that is, the Python None type, not the string "None").
  • If during your parse you see a field more than once, then store only the last value for that field, overwriting the previously-stored value in the dictionary. (In practice you'd probably track them all separately, but we'll elide this complication here.)
  • If the file was unparseable, that is, if it was not a valid JFIF, or if there was no Exif data, or if you are in 365 and the Exif data is little-endian, then raise an ExifParseError. (See https://docs.python.org/3.5/tutorial/errors.html#user-defined-exceptions and the starter code.)

What to do

The above tasks should all be handled by a parse_exif() function which takes a single argument, a file-like object containing (only and exactly) one JFIF.

def parse_exif(f):
    # do it!

    # ...

    return {'Make':'Apple', ...}

What to submit

Submit a single Python file named jpeg_exif.py. This file must define the three functions above, along with an ExifParseError, and should assume that tags.py is available for import. In other words, it should look like the following, but with actual implementations rather than placeholders for each function.

import tags

class ExifParseError(Exception):
    def init(__self__):
        pass


def carve(f, start, end):
    # return the bytes

    # here is an example that just returns the entire range of bytes:
    return f.read()


def find_jfif(f, max_length=None):
    # do some stuff

    # then return a possibly-empty sequence of pairs

    # here's an example that just returns the start and end of the file without parsing
    chunk = f.read()
    last_byte = len(chunk)
    return [(0, last_byte)]


def parse_exif(f):
    # do it!

    # ...

    return {'Make':'Apple'}

Tests

Below is a set of Python unit tests and associated files. To use them, place them all in the same directory as your jpeg_exif.py file. Then, pick one to run and execute it at the command line:

> python3.5 test_carve.py 
....
----------------------------------------------------------------------
Ran 4 tests in 0.001s

OK

If the test passes, you'll see OK as above. If not, you'll get details about which test failed, and how the expected output (the first operand of the assertEqual method) differed from your code's output (the second operand).

The tests:

  • test_carve.py: This file contains standalone tests for carve().
  • test_find_jfif.py: This file contains the tests everyone's code must pass for find_jfif(). It depends upon Designs.doc.
  • test_parse_exif.py: This finle contains the tests everyone's code must pass for parse_exif(). It depends upon FullSizeRender.jpg and gore-superman.jpg.
  • test_find_jfif_parse.py: This file contains the tests 590F students' code must pass for find_jfif(). It depends upon Designs.doc and minimal.jpg.
  • test_parse_exif_little_endian.py: This file contains the test 590F students' code must pass for parse_exif(). It depends upon leaves.jpg.

The media:

  • Designs.doc: a MS-Word document containing carveable JFIFs as well as non-JFIF FF D8 and FF D9 values.
  • minimal.jpg: a minimal JFIF with no Exif.
  • FullSizeRender.jpg: a big-endian Exif-containing file with a single IFD.
  • gore-superman.jpg: a big-endian Exif-containing file with multiple IFDs.
  • leaves.jpg: a little-endian Exif-containing file with multiple IFDs.