03: carving
Carving
Today we’re covering “Carving Contiguous and Fragmented Files with Object Validation”, another paper by Simson Garfinkel. It’s a pretty easy read and thus good for the early part of the semester. Later papers might not be so clear for various reasons.
Anyway: the paper is about file carving. Carving, as you’ll recall, is finding arbitrary files embedded in other files, without the use of filesystem metadata. This comes up when filesystem metadata is deleted, damaged, or otherwise unavailable, or when you’re looking for files that might be embedded in other files.
At a very high level, the way carvers work is by linearly scanning a file (representing a disk image, or that may contain embedded files, etc.), looking for markers or delimiters that indicate the start of files of interest. The carver then uses some algorithm to decide how much of the file to select, starting from the marker and running until some condition is met. This processes is repeated to extract all files of interest.
There are two high-level complications to consider, which I’ll cover in brief now and return to later on.
Complications
The simple form of carving described above handles only contiguous files. In other words, only byte ranges that are fully contiguous within the image file can be extracted. This means non-contiguous files will be missed (or at best, truncated).
So that’s an error! Which is the other complication. Carvers make mistakes. They can fail to find (or fail to find all of) a file that actually exists – these are kinds of false negatives. They can also find files that don’t actually exist, that is, the delimiter that was found might be part of unrelated data, and not actually indicate a file of interest. These are false positives.
One insight of this paper: Rapidly generate as many positives (false and true) in a linear scan; then use various fast tests to winnow the list down, using increasingly expensive (time/space) tests winnowing continues.
Paper contributions
- Survey of FS fragmentation from disks in the wild. How often do are non-contiguous files an issue in practice?
- Describes various carving techniques and uses them to define several carving algorithms.
- Applies these algorithms to 2006 DFRWS Forensics Challenge.
Fragmentation in the wild
Garfinkel collected used HDs (ranging in size up to 20GB – remember this paper was published in 2007) by purchasing them on eBay over an 8 year period. about 1/3 were sanitized, but 2/3 still contained user data.
sanitized – reformatted, including an overwrite-with-pattern (typically 0s)
At the time, this corpora contained ~324 drives with useful data; TSK found 2.2M files with filenames, about 2.1M with associated data, 892 GB recoverable. Corpora is larger now: https://digitalcorpora.org contains many real and synthetic images from HDs, Flash drives, phones etc; https://digitalcorpora.org/corpora/disk-images/real-data-corpus has more.
(Why work with files TSK can recover? Because then you have ground truth for evaluation of your carving algorithms.)
As of February 21, 2011, the Non-US Person’s Corpus consists of the following: 1,289 hard drive images ranging in size from 500MB to 80GB. 643 flash memory images (USB, Sony Memory Stick, SD and other), ranging from 128MB to 4GB. 98 CDROMs For a total of 70TB of data (uncompressed).
Forensically interesting things:
Many drives had no fragmentation! About 10% of drives had more than 10% of files fragmented. But: files of interest tend to be more likely to be fragmented.
What causes fragmentation? Modern OSes mostly attempt to avoid it, but sometimes cannot:
- Not enough contiguous free space (if drive is old, if drive is near capacity, and if lots of files have been added/deleted over time).
- If data is appended rather than a new file being created.
- FS may not support writing certain files sizes contiguously; UFS will fragment very large files, and fragment files that don’t occupy an even number of sectors. Other FS have other related restrictions.
Table 3 shows fragmentation by file type. Notably files of interest to forensic examiners (log files, avi, doc, jpeg, etc.) are more likely to be fragmented.
Bifragmented files
What about files that are split into just two fragments? Table 4. Most have a gap of of some power of 2 of a sector size, indicating they are split across some number of intervening sectors. As you may know, FAT/NTFS allocate sectors in groups called “clusters” that are always a power of two sectors long, so this implies a files was fragmented due to a single cluster being “in the way” of a contiguous write.
Table 5/6 shows the same distributions for JPEGs and HTML bigramgmented fiels in particular.
Table 8 shows gap sizes for fragmented blocks tend to cluster around power-of-two sizes. (Q: What’s going on with 0 gaps in this and other tables?)
“Highly fragmented” files
Some files (typically large, DLLs and CABs) are highly fragmented. Likely due to DL/install on a drive that’s already full o’stuff. Interesting observation: you could exclude these files if you knew what they looked like in advance (forensically uninteresting in most cases). This insight will come up again in later papers.
Object Validation
Carving requires being able to recognized files of interest, e.g., from headers and footers or some other way. Garfinkel calls it “object validation” because it’s not strictly files that we are interested in.
Fast object validation
Validation is a decision problem: Is this string of bytes a valid object, or not?
If you could do this fast, then you could do it for every substring in the image. How many substrings? n(n+1)/2.
200 GB drive? roughly 2 x 10^22 strings.
You can improve this in various ways. Sector boundaries only? 511/512 can be discarded (actually, more like 4095/4096 these days, or potentially higher in a cluster-based file system like FAT or NTFS!)
Then, you can find the “end” of your substring by binary searching (log n) for it.
200GB HD goes from 1.9 x 10^22 to about 4 x 10^8 objects to check, ~40 validations each.
JPEGs, for example, have a distinctive marker at the start, and so you can use this technique to find them (well, contiguous ones) at disk speed.
Headers / footers
If you have a header/footer marker, you can use it to reject some false positives. But it will keep some (for example, if there missing or extra sectors read into the middle of files).
Container structures
Some files contain additional structures of varying complexity. Binary fields that contain offsets to other fields; fields that contain constants, or that contain constrained values, or that indicate file length (max or min), and so on.
Decompression
You can again use file-specific details to check if a carved fragment is valid. For example, the “body” of JPEGs are Huffman-coded (more next class!) data that represents the picture. You can attempt to decode it, or you can just check that the Huffman symbols are correct or not; if not, it’s invalid.
Similarly, in Word files, you can check things like what is supposed to be the text sections – if it contains invalid characters, it’s likely to be an invalid Word file.
Practice report: JPEG decoders are tolerant to errors! But usually not entirely. Garfinkel found that “extra” data was never used in image reconstruction.
Semantic validation
“hospitals” example; choice of language; manual tuning corpora; etc.
Manual validation
A human should look at things. Duh.
Validation framework
This is a description of the tool that Garfinkel designed. Not super relevant to us.
Carving w/ Validation
Contiguous algorithms
- Header/footer carving
- Header / maximum size carving (which you can use when the format doesn’t care about extra data appended); binary search on end
- Header / embedded length: grow one sector at a time.
Then:
- Automatic trimming: either well-defined footer, or byte at a time until file no longer validates
Fragment recovery carving
Bifragment gap carving: basically brute force.
Let f1 be the first fragment that extends from sectors s1 to e1 and f2 be the second fragment that extends from sectors s2 to e2.
Let g be the size of the gap between the two fragments, that is, g= s2- (e1 + 1).
Starting with g=1, try all gap sizes until g=e2-s1.
For every g, try all consistent values of e1 and s2.
Essentially, this algorithm places a gap between the start and the end flags, concatenating the sector runs on either side of the gap, and growing the gap until a validating se- quence is found. This algorithm is O(n2) for carving a single ob- ject for file formats that have recognizable header and footer; it is O(n4) for finding all bifragmented objects of a particular type in a target, since every sector must be examined to deter- mine if it is a header or not, and since any header might be paired with any footer.