08: Similarity Digests

Overview

We use hashes to find content of interest and discard contents not of interest. As pointed out last class, full-file and even block-level hashes have problems when items of interest are not stored as files but instead embedded in other files (and thus, not block aligned). Or, a similar problem occurs when minor modifications happen to a file – if just a few bytes of a file change, it would be great if we could still recognize it, but regular hash functions fail in this scenario.

Today’s paper by Roussev takes an approach similar to Kornblum’s paper from last class: the overall idea is to look for something special in the input to find the significant or interesting parts, and to hash those parts in particular, and combine those hashes in a special way to create a “similarity digest.” Again, the goal here is that we have a function that computes a value for a given input (like a regular hash), but the value need not be identical to show two files are the same. Instead, we have a defined way to compare similarity digests to estimate how similar two files are.

Rabin fingerprints, which we talked about last class, are a way of generating a “rolling hash” of a stream of symbols. They were used in the 90s to find syntactic similarities between files in several publications: Roussev cites sif, Brin’s copy-protection scheme, and Broder’s web page similarities. He leaves out my personal favorite, Schleimer et al.’s algorithm for local fingerprinting (which underpins MOSS!). We might actually cover this paper later so I’ll leave the details out for now :)

“The basic idea, which is referred to as anchoring, chunking or shingling, is to use a sliding Rabin fingerprint over a fixed-size window that splits data into pieces. A hash value h is computed for every window of size w. The value is divided by a constant c and the remainder is compared with another constant m. If the two values are equal (i.e., m ≡ h mod c), then the data in the window is declared as the beginning of a chunk (anchor) and the sliding window is moved one position. This process is continued until the end of the data is reached. For convenience, the value of c is typically a power of two (c = 2^k) and m is a fixed number between zero and c − 1. Once the baseline anchoring is determined, it can be used in a number of ways to select characteristic features. For example, the chunks in between anchors can be chosen as features.”

The above is basically a description of how Kornblum’s algorithm finds trigger points.

“Consider two versions of the same document. One document can be viewed as being derived from the other by inserting and deleting characters. For example, an HTML page can be converted to plain text by removing all the HTML tags. Clearly, this would modify a number of features, but the chunks of unformatted text would remain intact and produce some of the original features, permitting the two versions of the document to be automatically correlated. For the actual feature comparison, the hash values of the selected features are stored and used as a space-efficient representation of a “fingerprint.””

The idea here is a (document-type) specific kind of “normalization”. This is not automatically generalizable – you have to know something about the document’s syntax / semantics to do the equivalent for new document types – but it could help reduce error rates.

Other notes about Rabin fingerprinting: “Rabin’s randomized model of fingerprinting works well on average, but suffers from problems related to coverage and false positive rates. Both these problems can be traced to the fact that the underlying data can have significant variations in information content. As a result, feature size/distribution can vary widely, which makes the fingerprint coverage highly skewed. Similarly, low-entropy features produce abnormally high false positive rates that render the fingerprint an unreliable basis for comparison.”

Non-Rabin Fingerprinting

The goal here is to select multiple characteristic “features” from a file, summarize them in some way, and have that summary be comparable to a summary from another file. The collection of features / summary is a fingerprint or signature. For this paper, the feature is just a sequence of bits. (“The expectation is that the approach would be complemented in the later stages by higher-order analysis of the filtered subset.”)

Three contributions:

  • New feature selection algorithm that selects statistically improbable features rather than Rabin-esque random selection.
  • Filtering of features based upon entropy measures (empirically: reduces FPR).
  • Similarity measure based upon Bloom filters, scalable to objects of arbitrary size.

Selecting improbable features

Can’t use text, so paper chooses a feature size of B = 64 bytes, empirically seems OK. (Implementation / concept same for other sizes.) Tradeoff: the smaller the features, the higher gran- ularity, the larger the digests and the more processing that is involved.

Algorithm is as follows:

Initialization: The entropy score Hnorm, precedence rank Rprec and popularity score Rpop are initialized to zero. A threshold t is declared.

Hnorm Calculation: The Shannon entropy is first computed for every feature (B-byte sequence): H = - sum (i = 0..255) P(X_i)log P(X_i), where P(X_i) is the empirical probability of encountering the byte value i. Then, the entropy score is computed as Hnorm = ⌊1000 × H/ log2 B⌋.

Note 1: This implies there may be computation of P(X_i), or that there’s a master table. My very quick exam of the sdhash source code indicates that it’s initialized to equiprobable, but I may have missed per-file tuning.

Rprec Calculation: The precedence rank Rprec value is obtained by mapping the entropy score Hnorm based on empirical observations.

Reason and details of mapping is unclear here, but they explain later – it’s how they do entropy reduction! In essence, they use the identify function most of the time, but drop features with very low or very high entropy calculations.

Rpop Calculation: For every sliding window of W consecutive features, the leftmost feature with the lowest precedence rank Rprec is identified. The popularity score Rpop of the identified feature is incremented by one.

Feature Selection: Features with popularity rank Rpop >= t, where t is a threshold parameter, are selected.

(show figure 1 from paper, explain)

The principal observation is that the vast majority of the popularity scores are zero or one; this is a very typical result.

Filtering weak features

Many files contain areas of low entryopy; features that are from areas of low entropy are low value (the probability that the feature will not be unique to a specific data object is almost 100%.) Empirically this happens in most files that are not already entropy-coded.

“During the filtering process, all features with entropy scores of 100 or below, and those exceeding 990 were unconditionally dropped from consideration. The latter decision is based on the observation that features with near-maximum entropy tend to be tables whose content is common across many files. For example, Huffman and quantization tables in JPEG headers can have very high entropy but are poor characteristic features.”

Generating digests

Once objects have been selected/filtered, it’s time to build fingerprints. Roussev uses Bloom filters.

In particular, selected features are hashed using SHA-1 (160 bits) and the result is split into five sub-hashes, which are used as independent hash functions to insert the feature into the filter.

The implementation uses 256-byte filters with a maximum of 128 elements per filter. After a filter reaches capacity, a new filter is created and the process is repeated until the entire object is represented.

One subtle detail is that before a feature is inserted, the filter is queried for the feature; if the feature is already present, the count for the number of elements inserted is not increased. This mechanism prevents the same feature from being counted multiple times, which reduces the false positive rate by forcing the inclusion of non-duplicate features; the accuracy of the similarity estimate is also increased.

So the bloom filters themselves are the similarity digest!

Comparing digests

How do you compare bloom filters? You can compute the expected number of bits in common, and estimate the max and min number of overlapping bits (see §3.4 in paper).

But the important things here are that you can define a configurable cutoff C to declare matches are due to chance alone, and empirically find a value Nmin that’s the minimum number of elements required to compute a meaningful score (empirically, 6 works). The former is important for avoiding FPs; the latter is also important, to signal to the user you just don’t have enough features to meaningfully compare the filter in question.

Then you can declare a scoring function.

“Informally, the first filter from the shorter digest (SD1) is compared with every filter in the second digest (SD2) and the maximum similarity score is selected. This procedure is repeated for the remaining filters in SD1 and the results are averaged to produce a single composite score.”

“The rationale behind this calculation that a constituent Bloom filter represents all the features in a continuous chunk of the original data. Thus, by comparing two filters, chunks of the source data are compared implicitly. “

“The size of the filters becomes a critical design decision – larger filters speed up comparisons while smaller filters provide more specificity. The parameters, including α = 0.3, have been calibrated experimentally so that the comparison of the fingerprints of unrelated random data consistently yields a score of zero.”

Experimental evaluation

Evaluated on sample document sets from the NPS corpus though do not include all results in paper due to space constraints.

Parameters as above. Varied fragment size (corresponding to block size, perhaps.)

Metrics: (Always an important question!)

  • Detection rate (correctly attribute sample to source), varies by value t as expected
  • Non-classification rate, depends upon fragment content only (if no features are extracted)
  • Misclassification rate, depends upon threshold chosen for similarity score (0..100, when do we say two frags are similar enough?)

Result summary:

Detection rates are near-perfect for thresholds up to 22, dropping linearly past this value.

Non-classification rates higher for smaller fragments (as you’d expect).

Misclassification rates: vary based upon cutoff. Around 43 is maximal for 512-byte fragments; around 21 for 1024+ byte fragments.

Storage/throughput: 2.6% of original size. 30MB/sec/core on modern processor (thus could be done in-line with imaging especially on a multicore machine). Also there are faster implementations now that can batch data to GPU.