Cryptographic hashes work great to identify segments of identical data.
But there’s a problem: what if we don’t know where the boundary of a file is? Like, the “JPEG-embedded-in-PDF” problem, or just some other circumstance where we can’t just do sector-based (or the equivalent) hashing? Or, what if the target of an investigation deliberately changes a file? Crypographic hashes are designed so that even a single bit change in the input will change ~1⁄2 of the output hash bits. Obviously the human eye might still be able to detect the similarity between documents, but can we do so automatically?
Today’s paper (Kornblum’s “…Context Triggered Piecewise Hashing”) is about one method to do so. (Notably it is independent of the underlying file type; there are methods for looking for, say, similar JPEGs, like PhotoDNA; we may talk more in depth about image-specific forensics later in the semester).
Piecewise hashing, rolling hashing
Piecewise hashing is a version of the “MD5 trick” for files rather than disk images. It’s just block hashing. Nothing deeper.
FNV: a fast but non-cryptograpic hash (makes no guarantees about collision resistance, preimage resistance, etc.). It’s the kind of thing you might use to compute hashes for a hash table when you are not concerned about hostile input.
Rolling hash functions, due to Rabin (or Karp and Rabin, depending upon context) are a different beast. In Kornblum, they are defined as a “a pseudo-random value based only on the current context of the input. The rolling hash works by maintaining a state based solely on the last few bytes from the input.” So, the value r_p for a given position p in the input is a function fo the last several bytes starting at position p.
Treat a k-gram c1 … ck as a k-digit number in some base b. The hash H(c1 …ck) of c1 …ck is this number:
c1 * b^(k−1) +c2 * b^(k−2) *… + c_(k−1)*b + ck
To compute the hash of the k-gram c2 … ck+1 , we need only subtract out the high-order digit, multiply by b, and add in the new low order digit. Thus we have the identity:
H(c2 …c_(k+1)) = (H(c1 …ck) − c1 * bk−1) * b + ck+1
Here the ‘c’s might be a byte value, and b might 256; the width of the hash is constrained by the underlying data type (be careful if you use signed vs unsigned).
(This formulation has a problem in that new values only tend to affect low-order bits. See Broder, “On the resemblance and containment of documents.” for at least one fix.)
Rolling hashes are (part of) how MOSS works, for those of you interested in plagiarism detection :)
Combining the hash algorithms: context triggered piecewise hashing
The key idea in Kornblum then is that we combine the two hashes to find a “better” fingerprint for a file, one that will hopefully still be useful if parts of the file change or have their relative offsets changed.
First, we compute the rolling hash of the file. Keep in mind this is actually a long sequence of hash values (almost as many as there are bytes in the file). But we don’t keep them all. Instead, we look for a particular marker value in the rolling hash. (Depending upon the size of the rolling hash, we might either look for a precise value, or a value mod a smaller size, etc. – the goal here is to get “enough” markers).
These markers serve as the boundaries for our traditional piecewise hash. Kornblum refers to these values as “triggers.”
So the CTPW hash is the sequence of piecewise hashes of the pieces delimited by the trigger values of the rolling hash.
Why “at most 2” changes if there is a change in the underlying file? Probably due to min-block-size, as described later.
spamsum is a specific implementation of the above. It uses a particular implementation of the rolling hash, FNV for the regular hash, a deterministic value for the trigger value (based upon constraints), and computes a signature. The exact details are in the paper if you are curious. The important thing to note is there is a minimum block size specified (power of 2) which relates to how large of a trigger value is looked for.
Ultimately, the signature consists of a sequence of base64 encoded LS6Bs (least significant 6 bits) of FNV hashes of the chunks delimited by triggers. Note that this is actually done twice: once for a blocksize of b, and once for a block size of 2b!
If the signature is too short, the block size is halved and the procedure runs again until a “long enough” signature is found.
Again, the signature contains another sequence, where the underlying block size is doubled anyway; so signatures for the final block size b and for double it, 2b, are both included.
Comparing spamsum signatures
Notably you need the same block size – this is why spamsum includes both b and 2b, so you can compare signatures within one step of each other. (Why might you need this? Because the block size is sensitive to the input, and if the input is tweaked, then the required blocksize might change.)
“Recurring” sequences are removed from signature. (Does bring up the question of why they are not removed in the first place, and what kind of recurrances in particular: 1-byte, 2-byte, 3-byte? Or whatever corresponds to the LS6Bs?) Reason: they correspond to repeating parts of the input.
Then the weighted edit distance between two hashes is computed. Given two strings s1 and s2, the edit distance between them is defined as ‘‘the minimum number of point mutations required to change s1 into s2’’, where a point mutation means either changing, inserting, or deleting a letter (Allison, 1999).
The spamsum algorithm uses a weighted version of the edit distance formula originally developed for the USENET newsreader trn (Andrew, 2002). In this version, each insertion or deletion is weighted as a difference of one, but each change is weighted at three and each swap (i.e. the right characters but in reverse order) is weighted at five.
Again, exact formula in the paper, but ultimately a “match score” from 0–100 is computed.
The match score represents a conservative weighted percentage of how much of s1 and s2 are ordered homologous sequences. That is, a measure of how many of the bits of these two signatures are identical and in the same order. The higher the match score, the more likely the signatures came from a common ancestor and the more likely the source files for those signatures came from a common ancestor.
CTPH can be used to identify documents that are highly similar but not identical. Kornblum shows anecdotally that altered MS Word docs still have high similarity (but does not check against other unrelated docs – maybe they all are similar to some degree due to headers etc?).
Another application of CTPH technology is partial file match- ing. That is, for a file of n bits, a second file is created contain- ing only the first n/3 bits of the original file. A CTPH signature of the original file can be used to match the second file back to the first.
“The comparison of partial files, especially footers, is significant as it represents a new capability for forensic examiners. A JPEG file, for example, missing its header can- not be displayed under any circumstances. Even if it contained a picture of interest, such as child pornography, the examiner would not be able to determine its content. By using CTPH to detect the homology between un- viewable partial files and known file, the examiner can develop new leads even from files that cannot be examined conventionally.”
Stream exams? Maybe, but only if you fixed the block size in advance.