Identification and Recovery of JPEG Files with Missing Fragments
So you want to recover JPEGs. No problem, if they’re contiguous. But what if they’re not? Today’s paper formulates two possible problems and proposes ways to solve them.
First, what if there’s a gap (or a small number), and the method of Garfinkel we saw last class isn’t good enough for some reason? Could you be smarter about choosing which blocks to test as part of the file, rather than the exhaustive search Garfinkel proposes? Spoiler: yes.
Second, what if an essential part of the JPEG is missing? Like, the header? Is it possible to recover any of the remaining image, or are we out of luck? Again, spoiler time: Yes (well, sometimes).
In order to really understand this paper, though, you need to understand the JPEG format as it commonly exists on disk. And unfortunately, the half-page text-only description in Section 3 is really only useful if you already know how JPEG works and you need a quick refresher. So we’re going to start today by diving into the JPEG format and getting fairly deep into the nitty-gritty of how it works.
On to JPEG
So, let’s talk a little about JPEG. In particular, let’s talk about the encoding of image data in JPEG (365 talks a bit about the headers and the EXIF tag, but now we’re concerned with the nitty-gritty of image data). Why? Because it’s a good exemplar of a complicated format that knowing some details about will help us with when attempt to recover files / data in that format.
At a high level, how does JPEG encoding work?
First, the data is transformed into the YCbCr color space from the more traditional RGB color space. Y = Brightness, Cb Cr are chrominance (split into blue and red).
The Cb and Cr are often downsampled (sometimes called subsampling), since they’re less important (but you need to know by how much if you want to reverse this later).
Next the image is split into blocks, usually 8x8, but again, you need to know the size, and each block (really, each triplet of blocks: Y, Cb, Cr) is handled separately from here on out.
Then there’s something called a DCT – think of it as a reversible mathematical transformation. The result is “quantized” – which is a many-to-few mapping; it truncates some of the data (losing the stuff that’s less visible to the human eye first) using a defined table of values. This table is typically used image-wide, and is needed to reverse (lossily) the quantization! The result is reordered in a particular way that will typically place runs of zeros together. It’s RLE-ed. Finally, the results of the quantization are Huffman-coded.
Let’s break that down.
(slides from class)
Easy things first: you can convert from one “colorspace” to another mathematically.
Turns out the human visual system cares more about brightness than color, so it’s useful to split it out. Look at the example to see this. (rods/cones; night vision is colorless, etc.)
And if we put these each on a greyscale (just for illustration) you can even more clearly that your eye “sees more” in the Y (brightness) component than the others.
Since the Cb and Cr channels are less informative to the human visual system, we don’t need to keep all of the information. Usually, JPEGs subsample them by a factor of two in each dimension. This is the first “lossy” thing JPEG does. We can reverse the transformation by scaling the shrunken Cb and Cr back up, but we do lose some of the data irreversibly here. The idea is that we’re losing the stuff that’s less “important” to the eye, so it’s not (very) noticeable.
The image (really, each of the three images: Y, Cb, Cr) are broken up into 8x8 blocks of pixels. Each block is then handled in sequence, left to right, top to bottom – one by one they are encoded in a special way and written to a file sequentially. Every set number of blocks, there is also written a “restart marker” – a particular byte sequence – that tells the JPEG decoder that the next byte is the start of a new group of blocks. This was originally for lossy transmission channels, but the authors of this paper leverage it to recover the second-half of (headerless) JPEGs.
These blocks are sometimes referred to as MCUs – Minimium Coded Units.
Now’s when it starts to get weird, unless you have a background in signals processing. We won’t go too deep here, but we will go a little ways.
So, let’s look at a block. It’s just eight by eight (64) pixels, with varying values. It’s hard to say where the “important” bits are.
So we’re going to “transform” it into another domain – frequency rather than space. What?!?!?
OK, so let’s do an example in a linear space first. (on board)
There are some number of “basis patterns” (which really are discretized
cos, that is, cosine functions). And any possible pattern is representable as a linear combination of these patterns. Why? And how? We’re not going that deep in this class, again, signals processing. This probably is familiar to you if you’ve seen FFTs, for example.
OK, so if we can do it in one dimension, we can do it in two, too.
First, we adjust the range from 0..255 to -128..127. Why? Not because it’s more compressible (same number of bits). But because DCT works best when the dynamic range (difference between absolute values) is smaller. Note that this transform is losslessly reversible.
Now, let’s consider what this looks like in two dimensions. We still have a 8x8 matrix, and we want to transform it into the frequency domain using the DCT. The basis functions can be represented visually as sorta checkerboards. See https://en.wikipedia.org/wiki/Discrete_cosine_transform#Example_of_IDCT
So we do the same sort of thing – what sum of DCTs, weighted, results in the input pattern? The weights are the output here (the basis is well-known).
Notably, the DCT is invertible losslessly – nothing goes missing here (within the limits of floating point.)
Something interesting happens as a result, though – most of the information (that is, values with largest magnitude) – are concentrated in the upper-left corner of the matrix. That is, the coefficients corresponding to the big swatches are weighted most heavily, and the tighter patterns (the fine details) less so. That’s because the big details matter most in reconstructing the image, but also: your eye notices fine details less. Which leads us to the next step in JPEG that leverages this fact.
Suppose we want to keep only the most visually important information in each block. We can “quantize” our DCT matrix, and selectively throw away some of the detail. How? We specify a “quantization matrix” Q which is an 8x8 matrix of numbers. Then we “quantize” our DCT matrix by dividing each value in the DCT by the corresponding value in the quantization matrix, and rounding. So, small values in the quantization matrix correspond to “important” values in the DCT; larger values correspond to less important values, as they’ll get pushed closer to zero due to the larger divisor.
Different encoders specify different Q; it’s also generally what you’re picking when you set a quality level in your JPEG.
You can reverse this transformation (though not losslessly – the quantization is deliberately throwing out some detail). But only if you have the quantization matrix, which is stored in the JPEG header!
Notice there’s lots of zeros in the matrix. If you’ve ever learned about compression, you probably know that lots of repeats are easy to compress. So first, we reorder the values in the matrix to make it more likely that the zeros in the lower-right are all contiguous in the output using “zigzag” ordering.
Now that we have our sequence of values with long “runs” of the same value, we do runlength encoding. In essence, RLE has you write out (value, count) pairs; it will shrink inputs with long runs (at the expense of expanding inputs without any runs). It is losslessly reversible, though.
Finally, we Huffman code the values. Huffman codes are a whole lecture in an a algorithms, but I’ll summarize here.
In short, you want to code a set of input symbols efficiently. The input symbols have a distribution, so you choose short outputs for common inputs, and longer outputs for uncommon inputs. Like, in English, you might give the letter ‘e’ the output symbol
' (like, 1 bit, set to 1). The other interesting thing about Huffman codes is that no (output) symbol is a prefix of any other, so you can read these outputs and unambiguously turn them back into the input, if you have the table available.
How do you build Huffman codes? There’s a bottom-up treebuilding algorithm to do this, but again, once you have the table, you’re good to go. Huffman coding (and decoding) is lossless. (Show example from Wikipedia.)
Now, some JPEG encoders use “standard” Huffman tables; others are per-device; others might do it per-image. But you absolutely need the table to invert the code!
Other minor details
There are some other things going on here that we’re not going to go too far into. One that does matter though, is that the value in the upper-left of the DCT is actually stored as the difference from the previous matrix’s upper-left value.
See, ‘cuz it’s smaller, and thus needs fewer bits. This means if you don’t have access to the previous DCT matrix, you don’t know the most important thing in this one!
(There are a couple other minor details like this we’re going to skip over, as they’re not as relevant to the recovery process.)
Now that we know how JPEG works at a low level, how can we leverage this information to understand how to recover fragmented JPEGs via carving? That’s the paper.
Non-optimal Huffman codes mean you can find blocks that are likely to be huffman-coded with those codes.
The steps you need header information for?
- width and height -> to determine the 8x8 blocks
- Quantization tables
- amount of Cr/Cb sampling (though you can guess this one, not many options)
- Huffman codes
But, what about the upper-left DCT-value depending upon the previous block’s value?
JPEG uses “restart codes” occasionally in image data, which signal that the next DCT block will be byte-aligned and will be reset to being offset from 0 (that is, not depend upon the previous block). Why? JPEG predates reliable network transmission :) But handy for us in this case.
So, if you can find other JPEGs on the same media with the same Huffman codes and/or quantization tables, you can decode from the first restart tag you find. Prepend a valid header to the recovered blocks and see what happens; you might need to do a search on Cr/Cb subsampling (and possibly on valid width/heights), but if the JPEGs share other details, it can work out.
What about progressive JPEGs?
Some JPEGs are stored in “progressive” mode. Progressively encoded JPEGs let you see a blurry version first, then a progressively sharper version as the image is decoded. They’re often slightly smaller (filesize-wise), but require more resources to encode and decode. How do they work?
Instead of writing each block sequentially, we pull out the most important parts of the blocks, and write them first. So, the upper-left value after the DCT – it tells us the most about block; these all get written first. Then one (or more) of the next few values, etc. This is called “spectral selection.”
You can go even further with this idea, just writing the most significant bits of the coefficients first, and the least significant bits later (called “successive approximation).
The encoder / decoder both must do many passes through the image, then. These are called “scans” and progressive JPEGs typically have about 10 scans.
Battery-powered and smaller devices do not typically encode progressive JPEGs, because of the memory and CPU costs. Likewise, they’re not often used on the web for the same reason (since so many clients are phones these days).
What are the implications for this recovery technique. where a header is missing? Bad, right? Now the “header” also includes the first few hundred bytes of the image data, which are the most important in terms of detail!
On the other hand, this means that if you get the front half of a truncated progressive JPEG, then you can still see the whole image (albeit degraded).