04: JPEG recovery
Identification and Recovery of JPEG Files with Missing Fragments
So you want to recover JPEGs. No problem, if they’re contiguous. But what if they’re not? Today’s paper formulates two possible problems and proposes ways to solve them.
First, what if there’s a gap (or a small number), and the method of Garfinkel we saw last class isn’t good enough for some reason? Could you be smarter about choosing which blocks to test as part of the file, rather than the exhaustive search Garfinkel proposes? Spoiler: yes.
Second, what if an essential part of the JPEG is missing? Like, the header? Is it possible to recover any of the remaining image, or are we out of luck? Again, spoiler time: Yes (well, maybe).
In order to really understand this paper, though, you need to understand the JPEG format as it commonly exists on disk. And unfortunately, the half-page text-only description in Section 3 is really only useful if you already know how JPEG works and you need a quick refresher. So we’re going to start today by diving into the JPEG format and getting fairly deep into the nitty-gritty of how it works.
On to JPEG
So, let’s talk a little about JPEG. In particular, let’s talk about the encoding of image data in JPEG (365 talks a bit about the headers and the EXIF tag, but now we’re concerned with the nitty-gritty of image data). Why? Because it’s a good exemplar of a complicated format that knowing some details about will help us with when attempt to recover files / data in that format.
At a high level, how does JPEG encoding work?
First, the data is transformed into the YCbCr color space from the more traditional RGB color space. Y = Brightness, Cb Cr are chrominance (split into red/blue).
The Cb and Cr are often downsampled (sometimes called subsampling), since they’re less important (but you need to know by how much if you want to reverse this later).
Next the image is split into blocks, usually 8x8, but again, you need to know the size, and each block (really, each triple of blocks: Y, Cb, Cr) is handled separately from here on out.
Then there’s something called a DCT – think of it as a reversible mathematical transformation. The result is “quantized” – which is a many-to-few mapping; it truncates some of the data (losing the stuff that’s less visible to the human eye first) using a defined table of values. This table is typically used image-wide, and is needed to reverse (lossily) the quantization! The result is reorderd in a particular way that will typically place runs of zeros together. It’s RLE-ed. Finally, the results of the quantization are Huffman-coded.
Let’s break that down.
Easy things first: you can convert from one “colorspace” to another mathematically.
Turns out the human visual system cares more about brightness than color, so it’s useful to split it out. Look at the example to see this. (rods/cones; night vision is colorless, etc.)
And if we put these each on a greyscale (just for illustration) you can even more clearly that your eye “sees more” in Y.
Since the Cb and Cr channels are less informative to the human visual system, we don’t need to keep all of the information. Usually, JPEGs subsample them by a factor of two in each dimension. This is the first “lossy” thing JPEG does. We can reverse the transformation by scaling the shrunken Cb and Cr back up, but we do lose some of the data irreversibly here. The idea is that we’re losing the stuff that’s less “important” to the eye, so it’s not (very) noticeable.
The image (really, each of the three images: Y, Cb, Cr) are broken up into 8x8 blocks of pixels. Each block is then handled in sequence, left to right, top to bottom – one by one they are encoded in a special way and written to a file sequentially.
Now’s when it starts to get weird, unless you have a background in signals processing. We won’t go too deep here, but we will go a little ways.
So, let’s look at a block. It’s just eight by eight (64) pixels, with varying values. It’s hard to say where the “important” bits are.
So we’re going to “transform” it into another domain – frequency rather than space. What?!?!?
OK, so let’s do an example in a linear space first. (on board)
There are some number of “basis patterns” (which really are discretized cos functions). And any possible pattern is representable as a linear combination of these patterns. Why? And how? We’re not going that deep in this class, again, signals processing. This probably is familiar to you if you’ve seen FFTs, for example.
OK, so if we can do it in one dimension, we can do it in two, too.
First, we adjust the range from 0..255 to -128..127. Why? Not because it’s more compressible (same number of bits). But because DCT works best when the dynamic range (difference between absolute values) is smaller. Note that this transform is losslessly reversible.
Now, let’s consider what this looks like in two dimensions. We still have a 8x8 matrix, and we want to transform it into the frequency domain using the DCT. The basis functions can be represented visually as sorta checkerboards.
So we do the same sort of thing – what sum of DCTs, weighted, results in the input pattern? The weights are the output here (the basis is well-known).
Notably, the DCT is invertible losslessly – nothing goes missing here (within the limits of floating point.)
Something interesting happens as a result, though – most of the information (that is, values with largest magnitude) – are concentrated in the upper-left corner of the matrix. That is, the coefficients corresponding to the big swatches are weighted most heavily, and the tighter patterns (the fine details) less so. That’s because the big details matter most in reconstructing the image, but also: your eye notices fine details less. Which leads us to the next step in JPEG that leverages this fact.
Suppose we want to keep only the most visually important information in each block. We can “quantize” our DCT matrix, and selectively throw away some of the detail. How? We specify a “quantization matrix” Q which is an 8x8 matrix of numbers. Then we “quantize” our DCT matrix by dividing each value in the DCT by the corresponding value in the quantization matrix, and rounding. So, small values in the quantization matrix correspond to “important” values in the DCT; larger values correspond to less important values, as they’ll get pushed closer to zero due to the larger divisor.
Different encoders specify different Q; it’s also generally what you’re picking when you set a quality level in your JPEG.
You can reverse this transformation (though not losslessly – the quantization is deliberately throwing out some detail). But only if you have the quantization matrix, which is stored in the JPEG header!
Notice there’s lots of zeros in the matrix. If you’ve ever learned about compression, you probably know that lots of repeats are easy to compress. So first, we reorder the values in the matrix to make it more likely that the zeros in the lower-right are all contiguous in the output using “zigzag” ordering.
Now that we have our sequence of values with long “runs” of the same value, we do runlength encoding. In essence, RLE has you write out (value, count) pairs; it will shrink inputs with long runs (at the expense of expanding inputs without any runs). It is losslessly reversible, though.
Finally, we Huffman code the values. Huffman codes are a whole lecture in 311, but I’ll summarize.
In short, you want to code a set of input symbols efficiently. The input symbols have a distribution, so you choose short outputs for common inputs, and longer outputs for uncommon inputs. Like, in English, you might give the letter ‘e’ the output symbol
' (like, 1 bit, set to 1). The other interesting thing about Huffman codes is that no (output) symbol is a prefix of any other, so you can read these outputs and unambiguosly turn them back into the input, if you have the table available.
How do you build Huffman codes? There’s a bottom-up treebuilding system to do this, but again, once you have the table, you’re good to go. Huffman coding (and decoding) is lossless.
Now, some JPEG encoders use “standard” Huffman tables; others are per-device; others might do it per-image. But you absolutely need the table to invert the code!
Other minor details
There are some other things going on here that we’re not going to go too far into. For example, value in the upper-left of the DCT is actually stored as the difference from the previous matrix’s upper-left value. See, ‘cuz it’s smaller, and thus needs fewer bits. This means if you don’t have access to the previous DCT matrix, you don’t know the most important thing in this one! There are a couple other minor details like this we’re going to skip over.
Now that we know how JPEG works at a low level, how can we leverage this information to understand how to recover fragmented JPEGs via carving? That’s the paper.
Non-optimal Huffman codes mean you can find blocks that are likely to be huffman-coded with those codes.
The steps you need header information for?
- Huffman codes
- Quantization tables
- amount of Cr/Cb sampling (though you can guess this one, not many options)
But, what about the upper-left DCT problem?
Turns out JPEG uses “restart codes” occasionally in image data, which signal that the next DCT block will be byte-aligned and will be reset to being offset from 0 (that is, not depend upon the previous block). Why? JPEG predates reliable network transmission :)
So, if you can find other JPEGs on the same media with the same Huffman codes and/or quantization tables, you can decode from the first restart tag you find.