COMPSCI 590K: Advanced Digital Forenics Systems | Spring 2020

Assignment 02: Hashing and recovery

This assignment is due by 9pm on Thursday, February 27th Tuesday, March 3rd. It must be submitted through Moodle.

  1. (10 points) Suppose you were going to use a Bloom filter to represent the data stored on a 10 TB drive, using the small-block method proposed by Garfinkel et al. Your intention is to be able to keep this Bloom filter in the memory of a reasonably-provisioned workstation for use in other forensics tasks. Describe how you’d parameterize the filter, including:

    a. What size blocks would you choose?
    b. How large is your Bloom filter?
    c. What is the value of k, that is, the number of hash functions you would use?
    d. What is the expected false positive rate (FPR) of your filter?

    (This problem is deliberately underconstrained: There is not a single “correct” answer, though some choices are are obviously not reasonable.)

    Briefly justify your answers for each to the above. For example, one poor answer to (a.) might read “I chose 1 TB blocks, because 1 TB is small compared to 10 TB, and then there will only be 10 entries in the Bloom filter.”

    Now, suppose instead that you sampled the drive before you loaded it into your filter, and discovered that it was mostly empty. In particular, only one out of every 10,000 sectors on the drive contained something other than entirely NUL (0x00) data. How would your answer to the above change? Again, briefly justify your answer.

  2. (20 points) Garfinkel et al. propose three types of file discriminators in “Using Purpose-Built Functions…”: one for JPEGs (specifically, for the image data portion of JPEGs – the headers of JPEGs can be recognized by attempting to parse them), one for MP3s, and one for Huffman-coded data. Choose one of the three and implement it. In other words, write a program that takes as input a file representing a block, and produces a positive output if the input exceeds the threshold required.

    Which should you choose? Up to you. Arguably JPEG is the easiest and general Huffman the hardest to implement, but it’s not a huge difference.

    What parameters should you choose? Use the ones presented in the paper (for example, for JPEGs, use HE=220 and LE=2).

    Then, test your program on a small data set you create, consisting of at least five examples of true positives and five example of true negatives. Each file should be at least 4KB long. Try to choose meaningful true negatives as per the paper – don’t just create five empty files (or files full of 0x00)!

    Include in your submission your program, the test files you used (or if they are large, links to them), the resulting confusion matrix (that is, a table of TP/FP/TN/FN), and a brief note about anything unusual you encountered while working on this question.

  3. (15 points) Suppose you were able to find the tail end of a JPEG by carving blocks and discriminating them as JPEG (maybe using the discriminator from the previous example), but that you were not able to recover the headers of that file. Fortunately, you were also able to carve the first 8 KB of another JPEG created by the same device. Using the method proposed in Sencar and Memon, reconstruct as much of the JPEG data from the tail end as possible.


    Include in your submission a cleaned-up rendering of the recovered JPEG’s contents in a lossless format such as PNG (do not submit a manually created JPEG that may or may not decode!). Also include a brief writeup of what you did to recover the image. You do not need to write a program to do this (though you can); a one-off reconstruction using a hex editor is fine.

    (For fun, run the 8KB header fragment through exiftool to see when/where it was created. Think about this every time you upload a photo somewhere.)

  4. (40 points) The DEFLATE reconstruction algorithm described by Brown has several steps: finding Huffman-coded data, finding distinct packets, decompressing those packets, reversing the redundancy removal, and then (essentially) guessing the unknown characters by building a model and fitting to it. I will spare you the pain of finding the Huffman-coded data and decompressing it: it’s not “hard” per se, but the actual details of the DEFLATE on-disk format can be tricky if you haven’t done much programming in this domain. And since data science / ML / AI are The Hotness these days, I’ll assume you already know how to do the last step of modeling the underlying text (or could work it out). That leaves reversing the redundancy removal.

    Let’s simplify the problem by only considering input consisting of printable ASCII. Suppose the compressed (redundancy-removed) data is written out in an expanded, text-based format. (In practice you would binary code this compressed data, but again, I’ll spare you having to write a binary parser.)

    The compressed stream is represented as a sequence of items, each enclosed in angle brackets; each item is separated from its successor by a single space. Non-redundant bytes are represented by an item the form <0, c>, that is, a zero, a comma, a space, and a character c, which may be any printable ASCII character. These represent a literal character in the uncompressed input. Note that newlines (that is, \ns) are not escaped – they show up as a newline in this format!

    Redundant bytes, that is, back-pointers to byte(s) that exist earlier in the stream, are of the form <1, offset, length>, where the offset is the distance back in the uncompressed input stream to find the match, and the length is the length of the match.

    For example, a compressed form of the text “hello world” would be represented as:

    <0, h> <0, e> <0, l> <0, l> <0, o> <0, ,> <0,  > <0, w> <0, o> <0, r> <0, l> <0, d>

    (There is not enough redundancy here to actually compress anything.)

    A compressed form of the text “a man, a plan, a canal: panama” would be:

    <0, a> <0,  > <0, m> <0, a> <0, n> <0, ,> <0,  > <1, 7, 2> 
    <0, p> <0, l> <1, 8, 6> <0, c> <1, 15, 2> <0, a> <0, l> 
    <0, :> <1, 15, 2> <1, 7, 3> <1, 26, 2>

    Obviously expanding things out into this easily-parseable format is not ideal if you actually care about compression :)

    (easy) Write a decompressor that takes as input compressed (redundancy-removed) data as above, and expands it.

    Here is the text of the Gettysburg Address, in original and compressed form, to use in your testing: [gettysburg.txt] [gettysburg-lz.txt]

    (slightly less easy) As above, but also handle compressed data that starts mid-stream. You can assume the data starts between < > items. Replace unknowable characters of the decompressed data with '?' characters. You can simulate this by trimming the front off of the provided sample data.

    (hard) Simplified problems are for other people, you want the real thing. OK! Suppose you’re given a truncated file compressed with the DEFLATE algorithm (you can use gzip to make your own). Reproduce as much of the algorithm from Brown’s paper as you can – your goal is to arrive at (at least) something like Fig. 4 in the paper.

    For this actually hard part, you are free to use an existing Huffman decoder (no need to write it yourself), but you’ll need to modify it to help you find individual packets within the file. Then, you’ll need to decompress each packet, and reverse the redundancy removal, which is binary-encoded instead of in the nicely-parseable format above. You are also welcome to start from an existing LZ77 implementation here, but it will require significant modification to work in the face of missing data. Some of the optional reading covers the binary specification of the DEFLATE format in sufficient detail to accomplish this, with or without use of an existing implementation.

    Include in your submission your program along with a brief explanation of how to invoke it if it’s not completely obvious.

  5. (20 points) ssdeep and sdhash are two tools for similarity detection. Create a corpus consisting of the following items (feel free to use publicly-available documents):

    • a JPEG file, at 100 KB in size (let’s call this image)
    • a text file, at least 10 KB (let’s call this textA)
    • another text file, at least 10 KB, unrelated to textA (let’s call this textB)
    • a version of image, at least doubled is size; the extra size is due to randomly-generated appended data
    • a version of image offset by a single bit (set or unset, doesn’t matter), and padded at the end with seven more zero bits – it should only be one byte longer as a result
    • a container file (PDF or Word Doc, perhaps) that contains textA
    • a container file (PDF or Word Doc, perhaps) that contains textA and image
    • a container file (PDF or Word Doc, perhaps) that contains textB and image
    • a binary file at least 5MB in size, consisting almost entirely of NUL (0x00) bytes; but, the file should also contain textA

    Suppose image, textA, and textB are “files of interest.” How similar to each of these do ssdeep and sdhash think the other files are? And, how might you interpret these numbers? Explain on the basis of your understanding of the algorithms, or the program documentation, or some other reasonable justification.

    Include your corpus in your submission.

Your submission should be comprised of your written answers, programs, and other required files. Putting it all into a reasonable archive format (.zip, .tar.gz) and uploading it through Moodle is how we expect you to get it to us.

Reminder: Group work is permitted (so long as you clearly indicate group members). But if you work in groups, we will generally expect a higher level of performance on the work.

It is also fine to collaborate in the construction of test data and corpora for this assignment, though I would prefer that the entire class not all use the same corpora.