COMPSCI 590K: Advanced Digital Forenics Systems | Spring 2020

06: Small block hashing


There is a growing need for automated techniques and tools that operate on bulk data, and specifically on bulk data at the block level:

  • File systems and files may not be recoverable due to damage, media failure, partial overwriting, or the use of an unknown file system.
  • There may be insufficient time to read the entire file system, or a need to process data in parallel.
  • File contents may be encrypted.
  • The tree structure of file systems makes it hard to parallelize many types of forensic operations.

When individual files cannot be recovered, small block techniques can be used to analyze file fragments.

When there is insufficient time to analyze the entire disk, small block techniques can analyze a statistically significant sample.

These techniques also allow a single disk image to be split into multiple pieces and processed in parallel.

Finally, because each block of an encrypted file is distinct, block-level techniques can be used to track the movement of encrypted files within an organization, even when the files themselves cannot be decrypted, because every block of every well-encrypted file should be distinct.

Two things in this paper: block hash calculations and bulk data analysis.



headers/footers + uni-bi-tri grams

problems with statistical analysis (container formats)

the MD5 trick aka hash-based data carving

compressed data (note about patents: they encumber products, not research!)

mp3scalpel (recognize adjacent sectors using frame headers)

Distinct block recognition

Hashes can be used to analyze the flow / possession of files, so long as several properties hold.

One is intrinsic to the hash function: collision resistance – how likely is it that two different files have the same hash?

(Though note: collisions are a performance issue, not a correctness issue. In the forensic context you would always then go verify the actual bits, not just the hashes, as matching.)

Another is related to the notion of files and “criminal networks”. If you want to infer that seeing a hash of a file that’s known to be on such a network means the computer (and thus person) were in contact with that network, then you need to (a) see the hash and (b) know that the hash/file isn’t present elsewhere.

What about if we look at blocks, rather than whole files? E.g., if we see a block that’s identical (by hash) across two drives, does it imply the associated file was copied from one drive to another? Maybe. Is the block rare/distinct, or not?


What does it mean for a block to be distinct? (Analogy to natural language.)

How does this matter for us? Some things are not very distinct (blocks with same values in all bytes, only 256 such blocks). Some are very likely to be distinct – high entropy, like, say, the middle of a Huffman-coded JPEG. Exceptions? What if it’s a very popular JPEG that lots of people have downloaded? Entropy alone is not a perfect measure (12 chars embedded randomly in an otherwise NUL block vs at the start, etc.).

Defining “Distinct”

Distinct Block (definition): A block of data that will not arise by chance more than once.

JPEGs produced by a camera of a lit visual scene. (“Cannot step into the same sunny day more than once.”)

Distinct blocks can be a powerful forensic tool if we assume these two hypotheses:

Distinct Block Hypothesis #1: If a block of data from a file is distinct, then a copy of that block found on a data storage device is evidence that the file was once present.

Distinct Block Hypothesis #2: If a file is known to have been manufactured using some high-entropy process, and if the blocks of that file are shown to be distinct throughout a large and representative corpus, then those blocks can be treated as if they are distinct.

Block, sector, file alignment

Most file systems align on sector boundaries. If block size == sector size, then they will exactly correspond on most file systems. True even if the file is fragmented so long as file system sector aligns blocks. (Not all FSs do this for every file; for example, very small files are not sector-aligned in NTFS; some Unix FSs do “tail packing”, and so on. But largely true.)

Choosing a block size

Paper uses 4K. Why not standard sector size (512B)?

  • less data
  • empirically more accurate
  • most files > 4K
  • “newer” (paper is from 2010) drives are starting to use 4K sectors

What about fragmentation?

Block sector alignment

Sliding window:

S0S1S2S3S4S5S6S7 => B0 
S1S2S3S4S5S6S7S8 => B1
S7S8S9S10S11S12S13S14 => B7

for sectors on disk. not clear if this is also applied to files (at a first read, no, since the justificaiton is sector alignment on disk might not match block alignment, but individual files you know where they start/end.)

Experimental results

nps-2009-domexusers (Windows XP system with two users, who communicate with a third user via email and IM)

Examined block sizes from 512B to 16K; started by removing constant blocks (many NUL blocks, also many with FF and other values). NUL corresponds to freshly-formatted drive sectors that have never been written to.

Amount of remaining data was roughly constant regardless of block size (implies that the feature size of the file system’s allocation strategy is larger than 16K).

Then they hashed the remaing results and stored them in a Bloom filter

Bloom what now?

A useful data structure to know about, and there are many variants/improvements/etc.

Bloom filters are a space-efficient way to probabilistically determine set membership of elements. They can have false positives (that is, given an element, mistakenly indicate it is in the set when it is not), but not false negatives (they’ll never mistakenly indicate an element is not in the set when it is.)

A bloom filter is a bit array of m bits. It also requires k independent hash functions, each of which maps an input to one of the m bit positions. To add an element to the filter, we hash it k times, and set to 1 each of the corresponding m bits. To check for membership, we hash an element k times, and see if all of the m bits are 1. The choice of m and k are dependent upon your space and false-positive-rate requirements.

Back to nps-2009-domexusers

Sector duplication can result from duplication within files (repeated regions), or from multiple copies of file on a drive. As Table 1 shows, approximately half of the non-constant containing sectors on nps-2009-domexusers are distinct.

The fraction of distinct sectors increases with larger block sizes. One possible explanation is that the duplicate sectors are from multiple copies of the same file. Recall that most files are stored contiguously. With small sampling block sizes there is a good chance that individual files will align with the beginning of a block sample. But as the sampling size increases, there is an increased chance that the beginning of a file will not align with a sampling block. If two files align differently, then the block hashes for the two files will be different.

In our data set we identified 558,503,127 (87%) blocks that were distinct and 83,528,261 (13%) that appeared in multiple locations. By far the most common was the SHA-1 for the block of all NULLs, which appeared 239,374 times. However many of these duplicates are patterns that repeat within a single file but which are not present elsewhere within the NSRL, allowing the hashes to be used to recognize a file from a recognized fragment.



Carving. Unlike traditional carving, hash-based carving searches for files that are already known in a master corpus.

Block_size = sector size. Larger block sizes are more efficient, but larger blocks complicate the algorithm because of data alignment and partial write issues.


  1. For each master file a filemap data structure is created that can map each master file sector to a set of sectors in the image file. A separate filemap is created for each master.

  2. Every sector of each master file is scanned. For each sector the MD5 and SHA-1 hashes are computed. The MD5 code is used to set a corresponding bit in a 224 bit Bloom filter. The SHA-1 codes are stored in a data structure called the shamap that maps SHA-1 codes to one or more sectors in which that hash code is found.

  3. Each sector of the image file is scanned. For each sector the MD5 hash is computed and the corresponding bit checked in the Bloom filter. This operation can be done at nearly disk speed. Only when a sector’s MD5 is found in the Bloom filter is the sector’s SHA-1 calculated. The shamap structure is consulted; for each matching sector found in the shamap, the sector number of the IMAGE file is added to the corre- sponding sector in each master filemap.

  4. Each filemap is scanned for the longest runs of consecutive image file sectors. This run is noted and then removed from the filemap. The process is repeated until the filemap contains no more image file sectors.

Why two hashes? MD5 is much faster than SHA, though SHA is more robust (for legal purposes). I don’t actually understand this argument, since once you carve the file you can do a bit-for-bit examination to see if it was a hash collision (FP) or actually the master file.


Often when performing file carving it is useful to remove from the disk image all of the allocated sectors and to carve the unallocated space. (Why? Because you can recover allocated data using traditionanl tools pretty easily.)

Our experience with frag_find showed us that fragments of a master file are often present in multiple locations on a disk image. This led us to the conclusion that it might be useful to remove from the disk image not merely the allocated files, but all of the distinct sectors from the image’s allocated files.

After some experimentation we created a tool called pre- carve that performs a modified version of this removal. We found that removing distinct blocks was not sufficient, as there were blocks that were shared in multiple files which could not be safely removed. We re-designed the tool so that it would remove any sequence of sectors from the unallocated region that matched more than r allocated sectors. After trial- and-error we found that r 14 4 provided the best performance when carving JPEGs.

Statistical sector sampling to detect the presence of contraband data

Sector-based hashing can be combined with statistical sampling to provide rapid identification of residual data from large files. This might be especially useful at a checkpoint, where the presence of a specific file might used as the basis to justify a more thorough search or even arrest.

Consider a 100 MB video for which the hash of each 512-byte block is distinct. A 1 TB drive contains approximately 2 billion 512-byte sectors. If one 512-byte sector is sampled at random, the chance that the data will be missed is over- whelmingd 2,000,000,000 200,000/2,000,000,00 14 0.9999.

Readers versed in statistics will note that we have described the well-known “Urn Problem” of sampling without replacement.

If 50,000 sectors from the TB drive are randomly sampled, the chance of missing the data drops precipitously to p z 0.0067 (N 14 2, 000, 000, 000, M 14 200, 000 and n 14 50, 000.) The odds of missing the data are now roughly 0.67%din other words, there is a greater than 99% chance that at least one sector of the 100 MB file will be found through the proposed random sampling.

Type discrimination

We can eyeball some files. But not others – worse, some files contain other files whole (doc, pdf, zip). Paper proposed “discrimination” not “identification”, where we report that something rises to a level of likely being of type X (and you can find it’s of more than one such type). This is especially relevant on the fragment leve.


  • Header recognition (works when things are sector aligned)
  • Frame recognition (many types of media are framed; might overlap with next)
  • Field validation (validate the header or frame for internal consistency)
  • n-gram analysis (per file-type, like, say, english text)
  • other statistical tests (entropy)
  • context (look at adjacent fragments, since most files are stored contiguously) Still relevant in SSDs, due to how controllers hide allocation decisions from OS!

Three discriminators

(Note that the hyperparameters were optimized via a grid search).


Header, obv.

Byte stuffing + Huffman coding. Look for blocks that have high entropy but more FF 00 than you would expect by chance.

Compare to goround truth for other file formats. Tune on these known values.


Framing data highly recognizable even though it can occur on any byte boundary:

  • Each frame header starts with a string of 11 or 12 sync bits (SB) that are set to 1.
  • The length of the frame is exactly calculable from the header.

So find the sync bits, calculate the length, skip forward and see if there’s another valid frame header.

huffman-coded discriminator

The DEFLATE compression algorithm is the heart of the ZIP, GZIP and PNG compression formats. Compression symbols are coded with Huffman coding. Thus, being able to detect fragments of Huffman-coded data allows distinguishing this data from other high entropy objects such as random or encrypted data. This can be very important for operational computer forensics.

We have developed an approach for distinguishing between Huffman-coded data and random or encrypted data using an autocorrelation test. Our theory is based on the idea that Huffman-coded data contains repeated sequences of variable- length bit strings. Some of these strings have 3 bits, some 4, and so on. Presumably some strings of length 4 will be more common than other strings of length 4. When a block of encoded data is shifted and subtracted from itself, sometimes the symbols of length 4 will line up. When they line up and the autocorrelation is performed, the resulting buffer will be more likely to have bits that are 0s than bits that are 1s. Although the effect will be slight, we suspected that it could be exploited.

  1. As with the JPEG and MPEG discriminators, we evaluate the input buffer for high entropy. If it is not high entropy, it is not compressed.
  2. We perform an autocorrelation by rotating the input buffer and performing a byte-by-byte subtraction on the original buffer and the rotated buffer, producing a resultant auto- correlation buffer.
  3. We compute the vector cosine between the vector specified by the histogram of the original buffer and the histogram of each autocorrelation buffer. Vector cosines range between 0 and 1 and are a measure of similarity, with a value of 1.0 indicating perfect similarity. Our theory is that random data will be similar following the autocorrelation, since the autocorrelation of random data should be random, while Huffman-coded data will be less similar following autocorrelation.
  4. We set a threshold value MCV (minimum cosine value); high- entropy data that produces a cosine similarity value between the original data and the autocorrelated data that is less than MCV is deemed to be non-random and therefore Huffman coded.

This discriminator rarely mistakes encrypted data for compressed data, and correctly identifies approximately 49.5% and 66.6% of the compressed data with 4KiB and 16KiB block sizes, respectively.

Application to statistical sampling

Although fragment type identification was created to assist in file carving and memory analysis, another use of this tech- nology is to determine the content of a hard drive using statistical sampling.

For example, if 100,000 sectors of a 1 TB hard drive are randomly sampled and found to contain 10,000 sectors that are fragments of JPEG files, 20,0000 sectors that are fragments of MPEG files, and 70,000 sectors that are blank, then it can be shown that the hard drive contains approximately 100 GB of JPEG files, 200 GB of MPEG files, and the remaining 700 GB is unwritten.

Lessons learned

Research and development in small block forensics is complicated by the large amount of data that must be pro- cessed: a single 1 TB hard drive has 2 billion sectors; storing the SHA1 codes for each of these sectors requires 40 GBdmore storage than will fit in memory of all but today’s largest computers. And since SHA1 codes are by design high entropy and unpredictable, they are computationally expensive to store and retrieve from a database. Given this, we wish to share the following lessons:

  1. We implemented frag_find in both Cþþ and Java. The Cþþ implementation was approximately three times faster on the same hardware. We determined that this speed is due to the speed of the underlying cryptographic primitives.
  2. Because it is rarely necessary to perform database JOINs across multiple hash codes, it is straightforward to improve database performance splitting any table that references SHA1s onto multiple servers. One approach is to use one server for SHA1 codes that begin with hex 0, and one for those beginning with hex 1, and so on, which provides for automatic load balancing since hashcodes are pseudo- random. Implementing this approach requires that each SELECT be executed on all 16 servers and then the results recombined (easily done with map/reduce). Storage researchers call this approach prefix routing (Bakker et al., 1993).
  3. Bloom filters are a powerful tool to prefilter database queries.
  4. In a research environment it is dramatically easier to store hash codes in a database coded as hexadecimal values. In a production environment it makes sense to store hash codes in binary since binary takes half the space. Base64 coding seems like a good compromise, but for some reason this hasn’t caught on.
  5. We have made significant use of the Cþþ STL map class. For programs like frag_find we generally find that it is more useful to have maps of vectors than to use the multimap class. We suspect that it is also more efficient, but we haven’t tested this.