Today’s paper is about recovering files compressed with the DEFLATE algorithm. As in last class’s paper, this scenario can arise when recovering deleted files, or when carving files from disk, or when files are damaged – all possiblities in the forensics context.
The same basic problem presents: important information is in the header of the file. This information controls a transformation (in this case, compression) of the file. Without it, it’s not immediately clear how to recover the contents of the file. And, unlike JPEG’s restart markers, it’s also not immediately clear how to re-synchronize, that is, find our position in, the partial DEFLATE stream.
To understand the method proposed by the paper, we need to understand how DEFLATE works. So just like last class, we’re gonna talk about it in some detail.
DEFLATE is a stream-based format. The input is arbitrary data (which you can think of as a sequence of bytes), and the output is a sequence of specially-formatted blocks of data, sometimes called “packets” (though they don’t have anything to do with networks per se). The block sizes are arbitrary, except that “non-compressible blocks” are limited to 65,535 bytes (more on this later). But in practice, blocks typically represent no more than 100–200KB of the original input stream (each block uses a fixed Huffman code, and often after about that much input, a new Huffman code works better than continuing the old one).
So, what actually happens? Two things. First, something called “redundancy removal” aka “duplicate string removal” Then Huffman coding.
Last class we talked about Run Length Encoding. Redundancy removal in DEFLATE is like that, but more powerful (that is, RLE is a strict subset of duplicate string removal).
The idea behind redundancy removal is that if a duplicate series of bytes is spotted (a repeated string), then a back-reference is inserted, linking to the previous location of that identical string instead of the literal repeat. An encoded match to an earlier string consists of an 8-bit length (3–258 bytes) and a 15-bit distance (1–32,768 bytes) to the beginning of the duplicate. Relative back-references can be made across any number of blocks, as long as the distance appears within the last 32 KB of uncompressed data decoded (since that’s the maximum distance).
(example from paper)
How does this subsume RLE? If the distance is less than the length, the duplicate overlaps itself, indicating repetition. For example, a run of 10 identical bytes can be encoded as one byte, followed by a duplicate of length 9, beginning with the previous byte.
There are many ways to tune the discovery and removal of redundant blocks; the Lempel-Ziv algorithm (LZ77) is typically used, but the exact details don’t matter for the purpose of understanding this paper.
So, what’s interesting here for our purposes is that the references are back references. Which means they could point back into parts of the file that we might not have (since we’re dealing with a file fragment). So that’s one immediate source of problems for recovering DEFLATEd files. But wait, there’s more!
- Yes, you can have references to references, so long as they eventually terminate.
- This (and entropy coding) operates on bytes, not just ascii characters.
So once the stream has had its redundancy removed, DEFLATE breaks it into chunks, and does one of three things with each chunk. Either the chunk is Huffman coded, or it is not; these are the output block. (Why not? Some data is not compressible, so Huffman codes actually expand it. Instead DEFLATE just embeds it as a literal.)
If a chunk is going to be encoded, there are two methods, both Huffman coding, that could be used – fixed or dynamic Huffman. For our purposes it doesn’t matter which is used. Another thing you need to know is that there are actually two Huffman trees: one for byte literals and run lengths, the other for distances.
If a file is truncated, you might not know where to find the Huffman code (for at least the first such block). That’s not terrible, if you could find the start of the next block, but that’s also tricky. Why? Because the block header is only three bits long. The first bit is “is the last block”, the next two are:
- 00: a stored/raw/literal section, between 0 and 65,535 bytes in length.
- 01: a static Huffman compressed block, using a pre-agreed Huffman tree.
- 10: a compressed block complete with the Huffman table supplied.
- 11: reserved, don’t use.
If a chunk is not going to be Huffman coded (that is, it’s a literal block), then after the three bit header, it is padded to the next byte boundary, and the next two bytes are length field (and the next two are a checksum), followed by up to 2^16 - 1 bytes of literal data (depending upon the length field).
Blocks are written sequentially, and are not byte aliged. That is, if a block ends in the middle of a byte, the next one starts in the middle of the byte.
So searching for all bit substrings of the form ‘000’ ‘001’ ‘010’ (and/or 100 101 110 for near the end of the file) is, uhh, not super great!
We can use the facts that (a) most of the time data ends up being Huffman coded, and (b) there is an explicit “end of block” symbol in the Huffman code that is at the end of each block to refine our search for block boundaries, or © that the length field in uncompressed blocks will aid in our search. And once we find one, the rest should just decode – so we need to find the earliest such boundary we can.
Finding the coded blocks
First, we find the last block; this cuts the search space in half (since we are fixing the first bit of the 3-bit header we’re looking for).
For the uncompressed case (which to be clear, is unusual) we can use the length and checksum to verify that it’s actually a valid block.
For the compressed case, we need “only” verify the last end of the stream is a valid end-of-block symbol. But this means we need the valid Huffman tree corresponding to this block. So here’s where some validation comes in. “nearly all candidate positions pass the checks for valid tree sizes,” in other words, at first glance, the encoded trees seem valid. But!
“over half fail in decoding the Huffman tree with which the bit-length values representing the actual Huffman compression trees are encoded,”
“and most of the remainder fail because the sequence of bit lengths computed using the first Huffman tree is invalid.”
In short, the Huffman trees are sent in their own encoded (“canonical”) format that have implicit constraints; if those constraints do not hold, this is easily detected and thus you can reject this position as being an actual start-of-block.
Then, “Fewer than 1 in 1000 possible bit positions yield valid Huffman trees; of those, 95% fail the end-of-data symbol check. Among the candidates which have a valid end-of-data symbol, over 80% are in fact valid, intact packets”
So once you find potentially valid start-of-blocks, you can check for valid Huffman trees (1000x reduction in search space). Then you check that the block ends with the end-of-block symbol. If it doesn’t, you’re not in the right place.
Once you do this for the last block, you repeat, scanning backwards into the file looking for the next block. And you can decompress them, of course – each independently.
Reversing redundancy removal
So once you Huffman decode, you’re left with the redundancy-removed input stream. We need reverse the redundancy removal. So you can go ahead and do this, for the most part. Except! Sometimes the redundancy removal will point into parts of the data you don’t have access to! Ruh-ruh rooby roo! You do know the lengths of the missing data, but you do not know its contents.
Now, is the point where the paper requires making some assumptions. In particular, it requires that you be able to meaningfully model the input stream, and to constraint its set of possible values. So here’s where the fine print is: the method works only if you can do this. In particular, the paper assumes text data (though they do test both English and Spanish), and a host of implicit related assumptions (that it’s all text; that you can build a reasonable language model, etc. etc.).
If you grant all of that, though (and probably, you can, if you are doing this you know something about the input format), then you can attempt to reconstruct the missing data as follows.
“we use two language models, constraint propagation, and a greedy replacement strategy”
“The two language models are a byte-trigram model (storing joint probabilities) and a word unigram model (a word list with frequencies).”
The idea of constraint propagation in this context is that you maintain a set of all possible values that can be in each unknown place, and then you apply various filters to make the sets smaller and smaller.
First, “the trigram model is used to eliminate values which are not possible because the corresponding trigrams are never seen in training text (e.g. “sZt” would not appear in English text, thus “Z” is not a possible value for an unknown byte occurring between “s” and “t”, while a blank would be quite likely and would not be eliminated from consideration).”
So “impossible” sequences result in the elimination of certain values in each possible position of unknown. And this happens for each group of (overlapping) three bytes. What about unusual data?
“Because the file being reconstructed could contain unusual sequences, all trigrams of literal bytes in the Lempel-Ziv stream are counted and added to the overall trigram counts (a form of self-training).”
So in addition to a trigram model built from (training) text, they train on any available literals in the stream. Neat idea.
In practice, they filter in three passes using a probabilistic constraint propagation on the trigrams. This was probably built heuristically.
Then comes a pass on words in two lists. “The first list consists of words con- taining only literal bytes; these will be used as part of the lookup process to determine possible words, analogously to how trigrams from the file were added to the pre-trained trigram model.”
“The second list consists of words containing one or more unknowns, and it is sorted to order the words by the likelihood that they can accurately be reconstructed and the usefulness of that reconstruction in finding additional replacements: fewest unknowns first; if the same number of unknowns, less ambiguous unknowns first; then longer words first (they provide more constraints); finally, most frequent words first.”
Then they again do constraint propagation on all words with at least one unknown.
“It is important to keep in mind that the reconstruction process described above is only suited for files which may be segmented into word-like units; in practice, this means files consisting primarily of text. The recovery of undamaged DEFLATE packets is applicable to any type of file, but the presence of scattered unknown bytes is likely to make reconstruction of non-textual files impractical by any method unless their format is extremely tolerant to corruption.”
- In court, you’d probably need to present the recovered text (that is, with blanks in place). Then you might also be able to add the other text (recovered), but it’s much less likely to stand up. But as a recovery technique this is still useful.
- Not necessarily limited to text, but you need underlying data to have a usefully modelable structure.
- Modern NLP might be able to give you better models (and thus results) though this simple model was pretty good.