18: Phone Forensic Triage
DRAFT
Forensic triage
So when an incident response team (generally law enforcement) arrives on scene these days, there are typically many, many digital storage devices present. Computer drives are obvious, but there are also routers, cameras, videogame systems, cell phones, and so on. These are all evidence that might need to be examined (as any might contain evidence of a particular crime or policy violation, depending upon the crime.)
Triage is a term taken from emergency response. In a large-scale emergency (aka a “mass casualty incident”), there are often many more victims/patients than emergency responders available. So, the cold calculus of utilitarianism is applied. Patients and their treatment are prioritized by the severity of their condition. “Probably gonna live,” “probably gonna die,” and “immediate care is likely to prevent an otherwise likely death” are the three categories I learned for initial response.
There’s a similar (though less grim) form of triage that is applied when investigators arrive on-scene and are confronted with tons of digital evidence. We’ve only touched on some of the details of it. For example:
- live memory analysis might happen or might not depending upon the OS and hardware of running computers
- computers might or might not be shut down immediately depending upon the likelihood of full-disk encryption being in use
- “jigglers” (USB keys that emulate a mouse that moves once a in a while) might be attached to running machines to prevent screensavers from kicking in
These are “stop the bleeding” type triage; another aspect of triage is categorizing what’s left. The backlog for full forensic analysis by a lab in many state crime labs is measured in months, so it’s important to prioritize the most likely evidence first.
You probably already have an inkling of how this might be done for computers, or really any device that presents a known filesystem that can be imaged (or examined behind a write blocker). For example, in a case involving contraband imagery, you might quickly check the filesystem (with fls
or the commercial equivalent) for JPG
files and view them on-scene.
But how about other kinds of devices that aren’t amendable to parsing by TSK and the like?
Cell phones
Why do we care about phones? Well, they’re ubiquitous, in a way that’s obvious now but wasn’t 20 years ago (kids today!). And they record the world around us – one of the most common uses of phones (after texting) is to take pictures, not to talk. It’s kinda funny that they’re called phones, really, but I guess that’s shorter than “texting and camera device”.
Cellular phones can be broadly divided into two classes. Smartphones like most of y’all have (iPhones / Android phones) and flip/feature phones like most of the world has.
Smartphones are similar to the computer on your desk, in that they have relatively uniform and well-understood filesystems.
But feature phones are a different beast. Feature and flip phones:
- run proprietary, undocumented OSes, with closed-source applications
- they are frequently updated in undocumented ways, and similar model numbers might have very different backend (storage of data)
In principle, one could reverse engineer each model phone and write a toolkit for it. In practice, there are literally hundreds if not thousands of unknown and subtly incompatible variants of the many featurephone OSes, with uncommunicative manufacturers (who might not even have the source code any more!), and where even finding example models of each phone can be difficult.
Whatcha gonna do?
Options for phone triage
We did talk about this a bit early on in the semester. What can you do?
Option 1: Manually browse the phone through its user interface. That is, page through the contacts, recent texts, images, and so on, ideally while recording it all on camera, to see what evidence is present.
There are of course some drawbacks here. It might not be possible (if the phone is locked or damaged). It modifies the phone (timestamps and the like might be updated). And it could miss important information (like if incriminating evidence has been deleted, it might not be visible in the UI).
Option 2: Commercial tools. Tools like Cellebrite and .XRY are the LE version of the devices they have at mall kiosks for swapping your address book to a new phone. These tools have custom translators written for most phones that can address the phone at a high level, asking for various information. They have a big ol’ pile of connectors to plug in. They have an enormous price tag (IIRC, the “academic” price we paid for one of these things was something like $5K, and I’m pretty sure LE pays a nontrivial multiple of that).
And, they still miss information. Why? Because although they can do an “image” of a phone’s storage, they typically don’t. Instead they use a higher-level communication protocol to get what they can. (This is the same protocol that lets phone kiosks do the address book transfer). But deleted items, etc., are still missing.
They might be embedded in the image, but these images are not parseable by automated tools for the reasons we described above. OR ARE THEY?
Let’s look at approaches that might work for phones (or more generally, any custom embedded file / filesystem where we don’t have the specification).
strings
One straightforward method we’ve seen before works (for some value of “works”) when you’re after textual data that’s been encoded in a known way. In short, look for “runs” of text, and extract them. strings
does this for you, but it doesn’t work on non-textual data, or on data that’s not been encoded in a known way (e.g., ASCII, UTF-8, etc.).
Deterministic carvers
scalpel
is an open source data carving tool. Using a built-in list of patterns that match either exact strings or regular expressions, it find instances of known file types in binary data and carves them out. The scalpel.conf
file lists the header/footer regular expressions (or strings). Now that you’re a pro at carving binary data, you can see how this approach is straightforward.
It misses some things, though. It assumes that what you want is identifiable by fixed (or deterministic) headers and footers. What if what you want is actually something like a regular expression? Or, what if it’s encoded (say, by ROT-13)? Or what if it’s compressed in a ZIP file that’s embedded in your image?
You can imagine a (large) set of rules to handle this sort of thing. For example:
- search for phone numbers
- search for english (or other language?) text by
strings
(also include: a list of arbitrary encodings, such as ROT-13, XOR with common patterns, etc.) - search for known file types by carving
- if a known file is expandable (ZIP, etc), expand and recursively carve
There is a tool that does all this: it’s called “Bulk Extractor” (see user manual for details.)
Neither is perfect, especially on disk images where files might be fragmented. But if you don’t know know disk allocation works (like clusters in FAT) this approach may be the best you can do.
Nondeterministic parsing
So how do we look for particular kinds of data we might want? Regular expressions. But what if there’s uncertainty about how to encode the regular expressions?
Well, how are RE usually encoded? State machines. COMPSCI 250 y’all! (Draw example for phone numbers.)
We can use a similar thing, a probabilistic finite state machine, to encode our uncertainties, like if we’re not sure if a call log entry includes an optional name before number before having a time stamp: (on board)
This is one of the key insights in DEC0DE.
DEC0DE
Robert Walls (et al.) proposed an automated approach to this problem, building upon prior work on small-block hash functions and various inference techniques (probabilistic finite state machines and decision trees; later revised to include some information retrieval / relevance ranking stuff to improve performance).
So, first we encode “feature” – the things we are looking for – as PFSMs.
Walls et al. refer to these as fields. Here’s an example of how a phone number field might be encoded on a Nokia phone (see ยง4 of Walls et al.)
You can imagine several such of these in parallel. (On board)
Then, we can infer the “most likely” parse using the Viterbi algorithm. In this way, we can parse through the relatively large amount of data on the phone that “might” match our PFSM expressions to find the “best” (for some definition) ones. (DEC0DE does a little more with decision trees to filter out obviously bad results, but you can read the paper if you’re curious.)
Then we can imagine linking together these fields into higher-level records, composed of multiple chained fields.
DEC0DE breaks it into two applications of Viterbi to minimize state explosion, and applies a few other tricks to keep runtime reasonable.
Turns out that’s still not good enough, since there are lots of false positives.
Two more things need to happen to dramatically improve performance.
Small block hashes
Garfinkel et al. noted that hash functions can be applied per-file, but they can also be usefully applied per-block (say, for example, 512 B blocks) within files to look for commonalities across files or to find files embedded in other files. (This is an extension / reapplication of the rsync
algorithm, which is well worth a glance at if you’re curious.)
The basic idea is that you compute a hash on the first n bits of an input (that is, at offset o = 0 from the start of the input). Then you compute a hash of the n bits at offset o = 1, then at o=2, and so on. The n is the size of the window, and the o is the amount you slide the window each time (on board). In the worst case, you end up with (filesize - n + 1) hashes.
But in practice you can often “align” the window with something. For example, if files are stored on disk and fragmented only at sector boundaries, you can leverage this by making your slide amounts be multiples of the sector size. Even if you don’t know the sector size for certain, it’s almost certainly a power of two which cuts down a lot on what you’d need to do.
Garfinkel leveraged this trick to thoroughly beat a forensic challenge (like adams.dd on synthetic data). He used strings
to find short strings of text (which are often surprisingly unique – see “shingling” as an old IR technique). He looked them up via Google, found the unique document they were sourced from, then hashed the document in 512B chunks. He then hashed the underlying image in 512B chunks, and identified the documents using the matching hash values, even when the metadata identifying them (like directory entries and/or FATS) were unavailable. This was the “MD5 trick.”
There’s lots you can do with this stuff, but Walls et al. used it very effectively for one purpose: removing potential false positives.
The key insight here is that across the same model of phone, there’s lots of data that’s the same (for example, the OS code itself). Almost certainly anything that’s bit-for-bit identical on a phone of interest and a generic off-the-shelf phone is not interesting forensically. And even across different models of phone, any blocks that are identical are again almost certainly irrelevant.
So, choose a reasonable block size and window size, store hashes of many common makes and models – these are the “boring” blocks. When doing a forensic triage exam, start by “throwing out” any blocks that hash to one of these known “boring” blocks. A single second phone of the same make and model lets you get almost all the benefit here, throwing away data that’s in common (and thus unlikely to be forensically interesting.)
These two ideas above (block filtering then inference) do surprisingly well. If we allow a little bit of human help (supervised learning via relevance feedback) you can do even better.
Relevance feedback
In later work (Liftr), Walls et al. showed you can improve the results by incorporating feedback from the user.
In short, DEC0DE does the best it can, but sometimes it’s unsure (is this a phone number?), or it’s wrong (“Name: XWQEFNsdfC”). If an investigator can label fields for DEC0DE and manually set either labels or probabilities, the system can re-assess existing data with those assumptions. It turns out this dramatically improves quality.
The basic algorithm is as follows:
- Group fields by NAND page (like a sector boundary for flash)
- Score each field using combination of features: Text quality (remove known false positives like CameCasedWords or
SELECT BY
or the like), a priori (investigator provides up to 5 strings of interest or the like), filename/filesytem knowledge (pages corresponding tocontacts.db
file will get higher score) - Score page using weighted sum of field scores
- Rank by page scores
Then have investigators label and score a few top-ranked pages, then repeat the inference step. Eventually the high value pages float to the top.
What do investigators actually have to do?
- Mark fields. For a given page, LIFTR tokenizes it, and the investigator marks any/all fields that appear to be semantically relevant. Names, phone numbers, etc. This can go better if the investigator has domain knowledge (like knowing that phone numbers might be encoded as nibbles) though LIFTR knows about this for some phones/fields/etc.
- Mark relevance. If a field is relevant to an investigation (or might be) the investigator flags it.
Then what does LIFTR do?
- Find semantically relevant pages. When a page shares a relevant token with an unmarked page it is more likely to be relevant.
- Blacklist tokens. See paper, but in short, some text and other strings are obviously irrelevant and can be ignored.
- Update quality scores on the basis of new tokens, labels, and relevance
- Display the top few results to the investigator, repeat as needed.
See the LIFTR paper for more details. This is really interesting approach that ties together a bunch of CS areas to improve a forensic tool: Traditional (Viterbi), ML (the use of decision trees), and IR (iterative relevance feedback) via HCI (the marking/relevance interface).