02: Survey

Welcome

Hello and welcome!

Still the most important thing to know today: the course web site is at http://people.cs.umass.edu/~liberato/courses/2019-spring-compsci590F/. It includes the syllabus for this class and you are expected to read it in its entirety.

If you are watching this on-line, great! Be sure to say hello on Piazza!

Next 10 Years

Let’s start by going over the Garfinkle paper. Please interrupt with questions if you have them!

  • common file formats, schemas and ontologies
  • system requirements, and argued that inefficient system design, wasted CPU cycles, and the failure to deploy distributing computing techniques is introducing significant and unnecessary delays
  • very few DF systems designers build upon previous workd instead, each new project starts afresh.
  • proactive (predicts attacks and changes its collection behavior before an attack takes place) vs reactive systems (“audit trails and internal logs”)
  • visualization
  • virtualization
  • digital forensics largely lacks standardization and process, and what little widespread knowledge that we have is “heavily biased towards Windows, and to a lesser extent, standard Linux distributions.
  • Unaddressed, Beebe says, is the problem of scalability, the lack of intelligent analytics beyond full-text search, non-standard computing devices (especially small devices), ease-of-use, and a laundry list of unmet technical challenges

history

early days

DF is roughly forty years old. What we now consider forensic techniques were developed primarily for data recovery.

early days:

  • Hardware, software, and application diversity.
  • A proliferation of data file formats, many of which were poorly documented.
  • Heavy reliance on time-sharing and centralized computing facilities; rarely was there significant storage in the home of either users or perpetrators that required analysis.
  • The absence of formal process, tools, and training.

There was also a limited need to perform DF. Evidence left on time sharing systems frequently could be recovered without the use of recovery tools. And because disks were small, many perpetrators made extensive printouts. As a result, few cases required analysis of digital media.

golden age 1999-2007

During this time digital forensics became a kind of magic window that could see into the past (through the recovery of residual data that was thought to have been deleted) and into the criminal mind (through the recovery of email and instant messages). Network and memory forensics made it possible to freeze time and observe crimes as they were being committed even many months after the fact.

This Golden Age was characterized by:

  • The widespread use of Microsoft Windows, and specifically Windows XP.
  • Relatively few file formats of forensic interest mostly Microsoft Office for documents, JPEG for digital photographs; and AVI and WMV for video.
  • Examinations largely confined to a single computer system belonging to the subject of the investigation.
  • Storage devices equipped with standard interfaces (IDE/ ATA), attached using removable cables and connectors, and secured with removable screws.
  • Multiple vendors selling tools that were reasonably good at recovering allocated and deleted files.

coming crisis

Hard-won capabilities are in jeopardy of being diminished or even lost as the result of advances and fundamental changes in the computer industry:

  • The growing size of storage devices means that there is frequently insufficient time to create a forensic image of a subject device, or to process all of the data once it is found.
  • The increasing prevalence of embedded flash storage and the proliferation of hardware interfaces means that storage devices can no longer be readily removed or imaged.
  • The proliferation of operating systems and file formats is dramatically increasing the requirements and complexity of data exploitation tools and the cost of tool development.
  • Whereas cases were previously limited to the analysis of a single device, increasingly cases require the analysis of multiple devices followed by the correlation of the found evidence.
  • Pervasive encryption (Casey and Stellatos, 2008) means that even when data can be recovered, it frequently cannot be processed.
  • Use of the “cloud” for remote processing and storage, and to split a single data structure into elements, means that frequently data or code cannot even be found.
  • Malware that is not written to persistent storage necessitates the need for expensive RAM forensics.
  • Legal challenges increasingly limit the scope of forensic investigations.

It is vital for forensics examiners to be able to extract data from cell phones in a principled manner, as mobile phones are a primary tool of criminals and terrorists. But there is no standard way to extract information from cell phones.

Similar problems with diversity and data extraction exist with telecommunications equipment, video game consoles and even eBook readers. These last two pose the additional problem that techniques used by to protect their intellectual property also make these systems resistant to forensic anal- ysis.

Our inability to extract information from devices in a clean and repeatable manner also means that we are unable to analyze these devices for malware or Trojan horses. For example, the persistent memory inside GPUs, RAID control- lers, network interfaces, and power-management co-proces- sors is routinely ignored during forensic investigations, even though it can be utilized by attackers.

Today a 2TB hard drive can be purchased for $120 but takes more than 7 h to image; systems and individuals of interest can easily have more storage than the police crime lab responsible for performing the analysis.

No matter whether critical information is stored in an unidentified server “somewhere in the cloud” or stored on the subject’s hard drive inside a TrueCrypt volume, these technologies deny investigators access to the case data.

Cloud computing in particular may make it impossible to perform basic forensic steps of data preservation and isolation on systems of forensic interest.

RAM forensics can capture the current state of a machine in a way that is not possible using disk analysis alone. But RAM DF tools are dramatically more difficult to create than disk tools. Unlike information written to disk, which is stored with the intention that it will be read back in the futuredpossibly by a different programdinformation in RAM is only intended to be read by the running program. As a result there is less reason for programmers to document data structures or conserve data layout from one version of a program to another. Both factors greatly complicate the task of the tool developer, which increases tool cost and limits functionality.

Among digital forensics professionals, the best approach for solving the coverage problem is to buy one of every tool on the market. Clearly, this approach only works for well-funded organizations. even though many professionals rely on open source tools, there is no recognized or funded clearing house for open source forensics software.

Training is a serious problem facing organizations that deliver forensic services. There is a lack of complex, realistic training data, which means that most classes are taught with either simplistic manufactured data or else live data. Live data cannot be shared between institutions, resulting in dramati- cally higher costs for the preparation of instructional material. As a result, many organizations report that it typically takes between one and two years of on-the-job training before a newly minted forensics examiner is proficient enough to lead an investigation.

US v. Comprehensive Drug Testing (Comprehensive Drug Testing, Inc, 2009) the Court wrote dicta that ran counter to decades of digital forensics practice and has dramatically limited the scope of federal warrant searches.

today’s (well, 2010’s) challenges

There are two fundamental problems with the design of today’s computer forensic tools:

  • Today’s tools were designed to help examiners find specific pieces of evidence, not to assist in investigations.
  • Today’s tools were created for solving crimes committed against people where the evidence resides on a computer; they were not created to assist in solving typical crimes committed with computers or against computers.

Put crudely, today’s tools were creating for solving child pornography cases, not computer hacking cases. They were created for finding evidence where the possession of evidence is the crime itself.

Evidence-oriented design has limited both the tools’ evolutionary path and the imagination of those guiding today’s research efforts:

  • The legitimate desire not to miss any potential evidence has caused developers to emphasize completeness without concern for speed. As a result, today there are few DF tools that can perform a useful five-minute analysis.
  • The objective of producing electronic documents that can be shown in court has stunted the development of forensic techniques that could operate on data that is not readily displayed. For example, despite the interest in residual data analysis, there are no commercially available tools that can perform useful operations on the second half of a JPEG file. Indeed, it was only in 2009 that academics showed it was even possible to display the second half of a JPEG file when the first half is missing (Sencar and Memon, 2009).
  • The perceived impermissibility of mixing evidence from one case with another has largely blocked the adoption of cross- drive analysis techniques (Garfinkel, 2006), even though cross-case searches for fingerprints and DNA evidence is now a vital law enforcement tool.

visibility, filter, report model

This model closely follows the tasks are required for evidence-oriented design (Section 3.1). For example, the model allows the analyst to search for a specific email address, but does not provide tools for extracting and prioritizing all email addresses that may be present. Because files are recovered before they are analyzed, certain kinds of forensic analysis are significantly more computationally expensive than they would be with other models. While some processes can be automated using scripting facilities, automation comes only at great expenses and has had limited success. Finally, this model does not readily lend itself to parallel processing. As a result, ingest delays are increasing with each passing year.

difficulty of reverse engineering

Many of today’s DF engineering resources are dedicated to reverse engineering hardware and software artifacts that have been developed by the global IT economy and sold without restrictions into the marketplace. But despite the resources being expended, researchers lack a systematic approach to reverse engineering. There is no standard set of tools or procedure. There is little automation. As a result, each project is a stand-alone endeavor, and the results of one project generally cannot exchange data or high-level pro- cessing with other tools in today’s forensic kit.

Monolithic applications

There is a strong incentive among a few specific vendors to deploy their research results within the context of all-in-one forensic suites or applications.

Lost academic research

There are relatively few cases of academic research being successfully transitioned to end users:

  1. Academic researchers can distribute open source tools that can be directly used, but most end users lack the skills to download tools and use them.
  2. Academic researchers can license their technology to a vendor, which then either sells the technology directly or incorporates it into an existing tool. It is difficult to find an instance of this happening.
  3. Vendors can read and learn from academic papers, perhaps creating their own parallel implementations of the work presented. But after numerous discussions with vendors it has become clear that they are relatively uninformed regarding the current state of academic forensic research.

New research directions

Forensic data abstraction

Today there are only five widely used forensic data abstractions:

  • Disk images are archived and transferred as raw or EnCase E01 files.
  • Packet capture files in bpf (McCanne and Jacobson, 1993) format are used to distribute network intercepts.
  • Files are used to distribute documents and image.
  • File signatures are distributed as MD5 and SHA1 hashes.
  • Extracted Named Entities such as names, phone numbers, email addresses, credit card numbers, etc., are distributed as ASCII text files or, in some cases, Unicode files. Named entities are typically used for stop lists and watch lists.

The DF community specifically needs to create a wide range of abstractionsd standardized ways for thinking about, representing, and computing with information ranging from a few bytes to a person’s lifetime data production. For example:

  • Signature metrics for representing parts of files or entire files, including n-grams, piecewise hashes, and similarity metrics. File metadata e.g. Microsoft Office document properties, JPEG EXIF information, or geographical information.
  • File system metadata e.g. such as timestamps, file ownership, and the physical location of files in a disk image.
  • Application profiles e.g. the collection of files that make up an application, the Windows Registry or Macintosh plist infor- mation associated with an application, document signatures, and network traffic signatures.
  • User profiles e.g. tasks the user engages in, which applications the user runs, when the user runs them, and for what purpose. Internet and social network information associated with the user, e.g. the collection of accounts that the user accesses, or user’s Internet “imprint” or “footprint” (Garfinkel and Cox, 2009).

Modularization and composability

Similar to the lack of standardized data format is the lack of a standardized architecture for forensic processing.

Alternative analysis models

Stream-based disk forensics

Stream-based disk forensics is clearly more important for hard drives than for SSD drives, which have no moving head to “seek.” But even without a seek penalty, it may be compu- tationally easier to scan the media from beginning to end than make a first pass for file-by-file recovery followed by a second pass in which the unallocated sectors are examined.

Stochastic analysis

another model for forensic processing is to sample and process randomly chosen sections of the drive. This approach has the advantage of potentially being very fast, but has the disadvantage that small pieces of trace data may be missed.

Prioritized analysis

Prioritized analysis is a triage-oriented approach in which forensic tasks are sequenced so that the operator will be presented with critical information as quickly as possible.

Scale and validation

Scale is an important issue to address early in the research process. Today many techniques that are developed and demonstrated on relatively small data sets (n < 100) fail when they are scaled up to real-world sizes (n > 10,000). This is true whether n refers to the number of JPEGs, TB, hard drives or cell phones.

Forensic researchers and tool developers need to hold themselves to a level of scientific testing and reproducibility that is worthy of the word “forensic.” New detection algo- rithms should be reported with a measurable error rate- dideally with both false positive and true positive rates reported. Many algorithms support one or more tunable parameters. In these cases the algorithms should be pre- sented with receiver operating characteristic (ROC) curves graphing the true positive rate against the false positive rate (Fig. 1) for a variety of parameter settings. Finally, consistent with the US Supreme Court’s Daubert ruling (Daubert v. Merrell Dow Pharmaceuticals, 1993), the research community should work to develop digital forensic tech- niques that produce reportable rates for error or certainty when they are run.

Moving up the abstraction ladder

Given the ability to treat collections of data and metadata as self-contained objects and to treat advanced forensic pro- cessing across multiple drives and data streams as simple function calls, researchers will be able to move up the abstraction ladder.

Identity management

we need approaches for modeling individuals in a manner that is both principled and computable. Such an abstraction would include representations for simple data elements like names, email addresses and identification numbers, but should also extend to ways for formally representing a person’s knowl- edge, capabilities and social network.

Data visualization and advanced user interfaces

Current tools use the standard WIMP model (window/icon/ menu/pointing device), which are poorly suited to presenting large amounts of forensic data in an efficient and intuitive way.

Visual analytics

Next-generation forensic tools need to integrate interactive visualization with automated analysis techniques, which will present data in new ways and allow investigators to interac- tively guide the investigation.

Collaboration

Since forensics is increasingly team effort, forensic tools need to support collaboration as a first class function.

Autonomous operation

New, advanced systems should be able to reason with and about forensic information in much the same way that analysts do today. Programs should be able to detect and present outliers and other data elements that seem out-of- place.

Carving

Let’s talk about file carving.

“Carving” is a generic term for extracting and assembling the “interesting” bits from a larger collection of bits.

Some of what what we’re going to do will have a lot of overlap with lexers and parsers, which can be automatically generated; if you’ve taken a compilers course this will sound familiar to you, though we won’t generally be using those techniques here.

Carving text (ASCII) from files

Suppose we don’t know anything about a file or filesystem. In the long term, we might take the time to reverse engineer the file type from both existing data, source code we might have access to, or worst-case binary reverse engineering.

But in the short term, we might try to extract meaningful data from the file or filesystem image. The simplest form of data we might try to pull out is text. How can we do this? A naive algorithm is to read bytes sequentially, outputting each run of bytes that represents valid ASCII text. We might set a minimum length on the runs to help ensure we’re getting valid values, and not just random values.

(Show the strings state machine.)

The strings utility is installed on most UNIX machines, and by default extracts ASCII strings from a given input consisting of four or more printable characters. The version of strings installed determines some of the fiddly behavior, like whether it only considers strings that are NUL or newline terminated.

If you run strings on a text file, then you just get the lines of that file that contain four or more characters:

# -e is to turn on escape characters ('\n') in my version of `echo`
> echo -e "Hello Marc\nabc\n\nGoodbye Marc" > test.txt  
# `cat` sends its input to standard output
> cat test.txt 
Hello Marc
abc

Goodbye Marc
> strings test.txt
Hello Marc
Goodbye Marc

The GNU version of strings (installed on my Mac as gstrings) allows you to search files not just for ASCII, but for general unicode in UTF-8, -16, or -32; for the last two, it lets you specify little or big-endian encodings — these are specified using the -e option. (More on this topic in a bit.)

> man gstrings
> gstrings -h
Usage: gstrings [option(s)] [file(s)]
 Display printable strings in [file(s)] (stdin by default)
 The options are:
  -a - --all                Scan the entire file, not just the data section [default]
  -d --data                 Only scan the data sections in the file
  -f --print-file-name      Print the name of the file before each string
  -n --bytes=[number]       Locate & print any NUL-terminated sequence of at
  -<number>                   least [number] characters (default 4).
  -t --radix={o,d,x}        Print the location of the string in base 8, 10 or 16
  -w --include-all-whitespace Include all whitespace as valid string characters
  -o                        An alias for --radix=o
  -T --target=<BFDNAME>     Specify the binary file format
  -e --encoding={s,S,b,l,B,L} Select character size and endianness:
                            s = 7-bit, S = 8-bit, {b,l} = 16-bit, {B,L} = 32-bit
  -s --output-separator=<string> String used to separate strings in output.
  @<file>                   Read options from <file>
  -h --help                 Display this information
  -v -V --version           Print the program's version number
gstrings: supported targets: mach-o-x86-64 mach-o-i386 mach-o-le mach-o-be mach-o-fat pef pef-xlib sym plugin srec symbolsrec verilog tekhex binary ihex
Report bugs to <http://www.sourceware.org/bugzilla/>

Data validity for strings

How do we know strings extracted in this way are meaningful? In general, we don’t, though there might be a line of inductive reasoning that could apply.

A supporting line of evidence might be argued probabilistically. What are the odds that n sequential bytes are ASCII? p^n, where p is the probability they’re each ASCII. If you assume that each character is generated IID, p = 95/256. (That’s kind of a weird assumption, unless your data source is a random number generator, though.)

That’s for a single sequence; what if you want to ask if by chance alone we found a run of n bytes out of m that were ASCII? There are n - m + 1 such runs. “a run” is the opposite of “no runs”. No runs is

((1-p)^n) ^ (n - m + 1), so consider 1 - that quantity. Or should you? These runs are definitely not IID, since each successive run contains a fraction (n-1)/n of the previous run, in order! Ultimately, we can play probability games for whatever question you want to ask. It’s important, then, to note that things like strings are best used in a way to help generate hypothesis or to reconstruct unknown file formats, and not generally to (attempt to) form inductive hypotheses.

More complex formats

So, you can extend this model in a pretty straightforward way. What if you want to find, say, all HTML files in a given disk image? Then instead of looking for just ASCII characters, you look specifically for <HTML> and </HTML>. That will definitely capture some things, but it might also capture invalid things (depending upon fragmentation).

How might it end up a mess? Fragmentation. When filesystems write to disk, they have to choose an allocation strategy. Files might be written contiguously, or might not (on board). If not, then things get…uglier.

How can we validate? Well, HTML is relatively easy to validate, in that you can make sure it’s text, and isn’t too badly out of compliance with the spec.

You can also think about how we might cut down on what to validate (or even carve!). For example, the smallest addressable unit on most devices is a section, either 512B or 4096B boundaries. So we need only check for the “start tag” there.

More on this next class.