COMPSCI 590K: Advanced Digital Forenics Systems | Spring 2020

03: Carving

Today we’re covering “Carving Contiguous and Fragmented Files with Object Validation”, another paper by Simson Garfinkel. It’s a pretty easy read and thus good for the early part of the semester. Later papers might not be so clear for various reasons.

Anyway: the paper is about file carving. Carving, as you’ll recall, is finding arbitrary files embedded in other files, without the use of filesystem metadata. This comes up when filesystem metadata is deleted, damaged, or otherwise unavailable, or when you’re looking for files that might be embedded in other files.

Note that this means the carver is totally independent of filesystem metadata! Doing things like changing extensions, etc., will work against tools that look at the filesystem metadata (as some tools, particularly triage tools, do), but matter not at all to carvers, which look only at the raw bytes stored on a disk.

At a very high level, the way carvers work is by linearly scanning a file (representing a disk image, or that may contain embedded files, etc.), looking for markers or delimiters that indicate the start of files of interest. The carver then uses some algorithm to decide how much of the file to select, starting from the marker and running until some condition is met. This processes is repeated to extract all files of interest.

Example carving

Suppose we had some text embedded in a binary file. (Demo creation using /dev/urandom and a hex editor). How do we “carve”, that is, extract out the text from the binary data?

Here’s a simple program to do so in Python (I assume Python is mostly a common language for you all; please ask if you need help understanding this code):

import sys

def is_char(b):
    return 0x20 <= b <= 0x7E

def strings(filename):
    with open(filename, "rb") as f:
        data =
    in_string = False
    current_chars = []
    for b in data:
        if not in_string:
            if is_char(b):
                in_string = True
        else:  # in_string
            if not is_char(b):
                if len(current_chars) > 3:
                    print("".join((chr(b) for b in current_chars)))
                current_chars = []
                in_string = False
    if in_string:
        if len(current_chars) > 3:
            print("".join((chr(b) for b in current_chars)))

def main():

if __name__ == "__main__":

So that just looks for strings of contiguous characters; you might also want to extract a particular filetype. For example, if you want to carve HTML, you’d do something similar, but require the first several characters be (case-insensitive) <HTML> and the last </HTML>. You could imagine various ways to optimize this, too, like not building up the array of characters at all until it starts with <HTML>. I expect you could do this yourself if asked, though, so I’m not going to spend class time on it.


There are two high-level complications to consider, which I’ll cover in brief now and return to later on.

The simple form of carving described above handles only contiguous files. In other words, only byte ranges that are fully contiguous within the image file can be extracted. This means non-contiguous files will at best be split into parts; they might also be truncated, or missed entirely, depending upon the details of the carving algorithm.

So that’s an error! Which is the other complication. Carvers make mistakes. They can fail to find (or fail to find all of) a file that actually exists – these are kinds of false negatives. They can also find files that don’t actually exist, that is, the delimiter that was found might be part of unrelated data, and not actually indicate a file of interest. These are false positives.

One insight of this paper: Rapidly generate as many positives (false and true) as possible in a linear scan; then use various fast tests to winnow the list down, using increasingly expensive (time/space) tests as the winnowing continues.

Paper contributions

  • Survey of FS fragmentation from disks in the wild. How often do are non-contiguous files an issue in practice?
  • Describes various carving techniques and uses them to define several carving algorithms.
  • Applies these algorithms to 2006 DFRWS Forensics Challenge.

Fragmentation in the wild

Garfinkel collected used HDs (ranging in size up to 20GB – remember this paper was published in 2007) by purchasing them on eBay over an 8 year period. about 13 were sanitized, but 23 still contained user data.

sanitized – reformatted, including an overwrite-with-pattern (typically 0s)

At the time, this corpora contained ~324 drives with useful data; TSK found 2.2M files with filenames, about 2.1M with associated data, 892 GB recoverable. Corpora is larger now: contains many real and synthetic images from HDs, Flash drives, phones etc; has more.

(Why work with files TSK can recover? Because then you have ground truth for evaluation of your carving algorithms.)

As of February 21, 2011, the Non-US Person’s Corpus consists of the following: 1,289 hard drive images ranging in size from 500MB to 80GB. 643 flash memory images (USB, Sony Memory Stick, SD and other), ranging from 128MB to 4GB. 98 CDROMs For a total of 70TB of data (uncompressed).

Forensically interesting things:

Many drives had no fragmentation! About 10% of drives had more than 10% of files fragmented. But: files of interest tend to be more likely to be fragmented.

What causes fragmentation? Modern OSes mostly attempt to avoid it, but sometimes cannot:

  1. Not enough contiguous free space (if drive is old, if drive is near capacity, and if lots of files have been added/deleted over time).
  2. If data is appended rather than a new file being created.
  3. FS may not support writing certain files sizes contiguously; UFS will fragment very large files, and fragment files that don’t occupy an even number of sectors. Other FS have other related restrictions.

Table 3 shows fragmentation by file type. Notably files of interest to forensic examiners (log files, avi, doc, jpeg, etc.) are more likely to be fragmented.

SSDs basically ignore one bad effect of fragmentation, that is, there is no seek time associated with skipping over unneeded blocks. But the files are still fragmented on the disk, so for a carver, the same issues still arise.

Bifragmented files

What about files that are split into just two fragments? Table 4. Most have a gap of of some power of 2 of a sector size, indicating they are split across some number of intervening sectors. As you may know, FAT/NTFS allocate sectors in groups called “clusters” that are always a power of two sectors long, so this implies a files was fragmented due to a single cluster being “in the way” of a contiguous write.

Tables 5 and 6 shows the same distributions for JPEGs and HTML bifragmented files in particular.

Table 8 shows gap sizes for fragmented blocks tend to cluster around power-of-two sizes. (Q: What’s going on with 0 gaps in this and other tables?)

“Highly fragmented” files

Some files (typically large, DLLs and CABs) are highly fragmented. Likely due to DL/install on a drive that’s already full o’stuff. Interesting observation: you could exclude these files if you knew what they looked like in advance (forensically uninteresting in most cases). This insight will come up again in later papers.

Object Validation

Carving requires being able to recognized files of interest, e.g., from headers and footers or some other way. Garfinkel calls it “object validation” because it’s not strictly files that we are interested in.

Fast object validation

Validation is a decision problem: Is this string of bytes a valid object, or not?

If you could do this fast, then you could do it for every substring in the image. How many substrings? n(n+1)/2.

200 GB drive? roughly 2 x 10^22 strings.

You can improve this in various ways. Sector boundaries only? 511512 can be discarded (actually, more like 40954096 these days, or potentially higher in a cluster-based file system like FAT or NTFS!)

Then, you can find the “end” of your substring by binary searching (log n) for it.

200GB HD goes from 1.9 x 10^22 to about 4 x 10^8 objects to check, ~40 validations each.

JPEGs, for example, have a distinctive marker at the start, and so you can use this technique to find them (well, contiguous ones anyway) at disk speed.

A question from a student:

How early was this method utilized and how ground break would this be considered. IS there a RANGE of the types of efficiencies that this carving could bring. When I looked online it said “File extensions can be faked – that file with an .mp3 extension may actually be an executable program. Hackers can fake file extensions by abusing a special Unicode character, forcing text to be displayed in reverse order.” If this is true then what is the impact on comparing the particular components of the files in question. This paper mentions that 99.8% of the files can be skipped since it might be faking its potential structure. Would love an opinion on this


What exactly happen if I change file.mp3 to (or something else)?

You need to think about how filesystems and file formats are distinct things. Modifying the former has no effect upon the latter.

Headers / footers

If you have a header/footer marker, you can use it to reject some false positives. But it will keep some (for example, if there missing or extra sectors read into the middle of files).

JPEGs have fixed starting and ending byte sequences (as do many binary formats); but note that the “occurring randomly” probability is a bit misleading here. Only true if the underlying data is drawn uniformly at random from all possible bytes, which is approximately true only in very restricted cases. But that actually works in our favor, as structured data will be biased in ways we can recognize (sometimes).

Container structures

Some files contain additional structures of varying complexity. Binary fields that contain offsets to other fields; fields that contain constants, or that contain constrained values, or that indicate file length (max or min), and so on.


You can again use file-specific details to check if a carved fragment is valid. For example, the “body” of JPEGs are Huffman-coded (more next class!) data that represents the picture. You can attempt to decode it, or you can just check that the Huffman symbols are correct or not; if not, it’s invalid. (How? By checking against the set of assigned symbols described in the header.)

Similarly, in Word files, you can check things like what is supposed to be the text sections – if it contains invalid characters, it’s likely to be an invalid Word file.

Practice report: JPEG decoders are tolerant to errors! But usually not entirely. Garfinkel found that “extra” data was never used in image reconstruction.

Semantic validation

“hospitals” example; choice of language; manual tuning corpora; etc. Some of this might be automated (NLP or simpler approaches for text).

Manual validation

A human should look at things. Duh.

Validation framework

This is a description of the tool that Garfinkel designed. Not super relevant to us.

Carving w/ Validation

Contiguous algorithms

  • Header/footer carving
  • Header / maximum size carving (which you can use when the format doesn’t care about extra data appended); binary search on end
  • Header / embedded length: grow one sector at a time. (Not sure why recursive doubling / binary search isn’t used; maybe not necessary for most smaller files?)


  • Automatic trimming: either well-defined footer, or byte at a time until file no longer validates

Fragment recovery carving

Bifragment gap carving: basically brute force.

Let f1 be the first fragment that extends from sectors s1 to e1 and f2 be the second fragment that extends from sectors s2 to e2.

Let g be the size of the gap between the two fragments, that is, g = s2 - (e1 + 1).

Starting with g=1, try all gap sizes until g=e2-s1.

For every g, try all consistent values of e1 and s2.

Essentially, this algorithm places a gap between the start and the end flags, concatenating the sector runs on either side of the gap, and growing the gap until a validating sequence is found. This algorithm is O(n2) for carving a single object for file formats that have recognizable header and footer; it is O(n4) for finding all bifragmented objects of a particular type in a target, since every sector must be examined to determine if it is a header or not, and since any header might be paired with any (subsequent) footer.