06: JPEG and Exif

Estimated time to complete: four hours (or less, if you are experienced with Python and parsing binary file formats)

JPEG (really: JFIF) is a widely-used lossy compression format for images. JPEG files often have interesting embedded metadata in the Exif format. In this assignment, you’ll write a program that can carve JPEG files out of a larger document or disk image. Then, you’ll write a program to parse (a subset of) the Exif metadata format. Carving and parsing are two common forensic tasks, and learning to do so on JPEG/Exif is more than just an academic exercise – it also validates that the standards work the way they are documented.

Carving JPEGs

As we discussed in lecture, the JPEG File Interchange Format is the container format in which JPEG images are stored. This format is how “.jpg” files are stored on disk. JFIF has a well-defined file structure that is relatively straightforward to recognize and parse.

Many other document formats, such as Microsoft Word .doc and .docx files, or Adobe’s Portable Document Format (.pdfs), embed external media directly into their files; you can linearly scan through these documents to find JFIF data. These JPEGs might or might not be accessible in the document; they might live in “slack space” if the document format entails slack. Similarly, you can scan through disk images to extract some JPEGs (though not all, as we’ll see later in the semester), and again, files that have been deleted might still be accessible as the data.

To a first approximation, you can carve JFIF files in a straightforward, greedy way. In particular, you can:

  • First, linearly scan through a file (viewed as bytes) looking for and tracking the offsets of byte sequences that correspond to the “Start of Image” (FF D8) and “End of Image” (FF D9).
  • Then, for each (SOI, EOI) pair that “makes sense,” extract the bytes from the beginning of the SOI bytes to the end of the EOI bytes. What does it mean for a pair to “make sense”?
    • The SOI should come before EOI in the file.
    • (optionally) The number of bytes being extracted should be less than or equal to some pre-specified cutoff.

Why is this an approximation? Because the SOI/EOI markers are just data: they might appear anywhere in a file, including a non-JPEG file. For a given SOI marker, there may be several sensical EOI markers that correspond to it.

What to do

To start, implement a carve() method that given a file-like object, and a start (inclusive) and end (exclusive) offset, returns a bytes sequence containing the bytes between those offsets:

def carve(f, start, end):
    # return the bytes

    # here is an example that just returns the entire range of bytes, 
    # in other words, it does not do the right thing
    return f.read()

If you re-use the same file-like object, you will find that as you read() through it to the end, you can’t re-use it without re-opening it. Or can’t you? You can do the right thing, which is to understand that in the standard UNIXy abstraction for files, there is a way to adjust the current position within the file. This is called “seeking.” The io module documentation on seek() explains the details, but in short, you can use f.seek(n) to seek to the nth byte of the file f, starting from the beginning of the file. Use f.seek(0) to jump back to the start of the file.

This is called an seeking to an absolute position; you can seek from the current position or end of the file too; again, read the docs if you want to do so.

(We won’t actually use the carve() function to carve the bytes, though doing so and then writing them to a file is fairly straightforward. Open a file in wb mode and write() them.)

Next, implement a find_jfif() function which, when passed a binary file-like object (as returned by open(filename, 'rb')) and an optional maximum length, returns a sequence of pairs. Each pair in the sorted sequence should represent a pair of offsets into the file, to a sensical SOI/EOI pair. Your function should look something like:

def find_jfif(f, max_length=None):
    # do some stuff

    # then return a possibly-empty sequence of pairs

    # here's an example that just returns a sequence consisting of only
    # one pair: the start and end of the file 
    # it doesn't do  any parsing
    chunk = f.read()
    last_byte = len(chunk)
    return [(0, last_byte)]

NOTE: Due to an error when adjusting this assignment from last year, the start/end points (for calculating length) are both inclusive, unlike the arguments to carve(), which follow the Python standard of start is inclusive, end is exclusive. Sorry ‘bout that, but I don’t want to change it now as people have already submitted answers under the currently-expected behavior.

Parsing Exif

So now you’ve got a way to extract JFIF files. What can you do with them once you’ve got them? Find and parse the Exif data (if it exists), of course!

Generally, parsing Exif will follow the process we did manually in class. First, check if the file starts with an SOI marker. Then, read through the segments until you find the first Exif segment. Parse this segment:

  • Determine whether it’s a big-endian or little-endian; handle multibyte data after this point appropriately.
  • Find the start of the IFD.
  • Determine the number of entries.
  • For each entry:

    • Determine its name from its tag. We provide a file tags.py containing the list of names you must handle; if the tag is not in this file, skip the entire entry. You’ll need to import this file to access its one defined variable.
    • If its type (format) is format is one of 1, 2, 3, 4, 5, or 7, then parse out its value(s), which may be stored in the data field, or in an offset pointed to by the data field. If the value is a single number, store it as such; if it is more than one number, store it as a list of numbers; if it is textual, store the parsed text (that is, as a native Python string – strip the trailing NUL (00) byte and decode it as UTF-8); if it is undefined (raw), store it as a string in hexadecimal format; if it is a rational number, store it as a string numerator/denominator with appropriate values rather than text.

      For some types, you need only parse and store the first value; for some you’ll need to store several (like text). We provide a format table, which is a cheat sheet derived from the Exif specification about each possible format, how to use struct.unpack to decode its values, and whether we want you to get the first or all components. Be careful if you copy/paste code out of this table – depending upon your editor, it might paste in curly quotes(‘“”’) instead of straight quotes(‘“”’ – different codepoints!), which will choke your Python interpreter.

  • If there is another IFD, repeat. (Usually the second IFD is for the thumbnail, if any, embedded in the JFIF.)

Store and return the parsed entries in a Python dictionary:

  • Each key in the dictionary should be a string, taken from the list of valid tags in tags.py, corresponding to an entry in the IFD(s). If the tag is not in tags.py, then there should be no entry in the dictionary.
  • Each value in the dictionary is a list. Each time you see an IFD tag, you should append to the corresponding parsed value to the list in the dictionary. (You can look at the tests below to see the expected format.)
  • If the type (format) of the field is in the list of formats we asked you to parse, the value associated with this key should be the parsed value: it will either be a number, a list of numbers, or a string. If the type is not in the list of formats we asked you to parse, set it to None (that is, the Python None value, not the string "None").
  • If the file was unparseable, that is, if it was not a valid JFIF, or if there was no Exif data,then raise an ExifParseError. (See https://docs.python.org/3.5/tutorial/errors.html#user-defined-exceptions and the starter code.)

What to do

The above tasks should all be handled by a parse_exif() function which takes a single argument, a file-like object containing (only and exactly) one JFIF.

def parse_exif(f):
    # do it!

    # ...

    # Don't hardcode the answer! Return your computed dictionary.
    return {'Make':['Apple'], ...}

Notably it does not interact with your file carver from the first part of the assignment in any way. (A real system might, but here we’re keeping them independent.)

What to submit

Submit a single Python file named jpeg_exif.py. This file must define the three functions above, along with an ExifParseError, and should assume that tags.py is available for import. In other words, it should look like the following, but with actual implementations rather than the placeholder pass for each function.

import struct

import tags


class ExifParseError(Exception):
    """
    An exception representing a failure to parse a JPEG for a valid Exif tag.
    """
    def init(self):
        pass # note: don't have to do anything else for this method


def carve(f, start, end):
    """
    Carve and return the bytes of the file-like object f from offsets start (inclusive)
    to end (exclusive).

    In other words, the endpoint behavior parallels the Python slicing convention.
    """
    pass


def find_jfif(f, max_length=None):
    """
    Return the offsets ((start, end) inclusive pairs) of JFIF-seeming data within f


    :param f: a file-like object
    :param max_length: the maximum size of the interval (start, end) to consider
    :return: a list of offsets
    """
    pass


def parse_exif(f):
    """
    Return the Exif data from the JFIF file stored in the file-like object f.
    """
    pass

Tests

I’m not a total monster, so for this assignment I’m providing a subset of the Gradescope tests. Below is a set of Python unit tests and associated files. To use them, place them all in the same directory as your jpeg_exif.py file. Then, pick one to run and execute it at the command line, for example:

> python3.6 test_carve.py 
....
----------------------------------------------------------------------
Ran 4 tests in 0.001s

OK

If the test passes, you’ll see OK as above. If not, you’ll get details about which test failed, and how the expected output (the first operand of the assertEqual method) differed from your code’s output (the second operand).

The tests:

  • test_carve.py: This file contains standalone tests for carve().
  • test_find_jfif.py: This file contains tests your code must pass for find_jfif(). It depends upon Designs.doc.
  • test_parse_exif.py: This file contains tests your code must pass for parse_exif(). It depends upon FullSizeRender.jpg and gore-superman.jpg, both of which contains Exif data in big-endian byte order.
  • test_parse_exif_little_endian.py: This file contains a test your code must pass for parse_exif(). It depends upon leaves.jpg, which contains Exif data in little-endian byte order.

The media:

  • Designs.doc: a MS-Word document containing carveable JFIFs as well as non-JFIF FF D8 and FF D9 values.
  • minimal.jpg: a minimal JFIF with no Exif, in case you want such a thing.
  • FullSizeRender.jpg: a big-endian Exif-containing file with a single IFD.
  • gore-superman.jpg: a big-endian Exif-containing file with multiple IFDs.
  • leaves.jpg: a little-endian Exif-containing file with multiple IFDs.