03: Forensic Science; Data representation

Welcome

Announcements

The most important thing to know today: the course web site is at http://people.cs.umass.edu/~liberato/courses/2019-spring-compsci365/. (This announcement will go away after add/drop ends.)

Public health announcement: If you are obviously contagiously sick, please stay home. Especially if you have flu-like symptoms. Read the lecture notes online and/or get them from a friend.

Today we’ll finish our motivating example, talk about why and how digital forensics can be a science, talk a little bit about data representation, and run through some practical Python stuff that might help you on the hexdump assignment that will be posted shortly.

What makes forensics a science?

Throughout this class, we will present many techniques for recovering forensic evidence from computer systems. The skills you will learn can be applied in many different scenarios. For example, recovery of erased data is useful simply when data is deleted accidentally and there does not need to be a crime involved.

Under what conditions is the practice of forensics a science rather than a series of related techniques that recover data?

When the investigator follows a repeatable, structured process for gathering evidence and uses strong inductive reasoning to reach conclusions, as we explain below.

Specifically, a scientific forensic investigator makes three critical steps in investigations.

The process begins when an investigator has judged that an alleged crime or other event is worth investigating.
Next, the investigator gathers evidence.
Finally, a hypothesis is supported that best explains the events that took place.

The first step is a dictated by the resources available to the investigator and the law’s definition of crime, civil law, or a company’s internal policy. For example, there are more crimes and criminals than law enforcement can handle and investigations often have to be prioritized.

Evidence gathered in Step 2 has its origin in the transfer of artifacts at a crime scene that is described by Locard’s Exchange Principle. How evidence is gathered will greatly affect the validity of the results; however, this is a topic we leave for later. The investigator attempts to refine artifacts into evidence in several stages: identification: determining an artifact’s class characteristics; individualization: narrowing the class to one; association: linking a person with a crime scene through the individualized evidence; intentionality: inferring the intent of the person. Each stage is challenging: for example, not all artifacts may be found, and intent is often hard to discern.

In the last step, the investigator finds that a hypothesis can be supported by the evidence that answers questions specific to forensics: what crime, who did it, what was their intent, what were their specific actions? The type of reasoning employed by the investigators to answer these questions determines whether forensics is a science. In general, there are three options, but only the last option provides a scientific argument.

Abductive Reasoning: the investigator reasons about the crime based on the most likely explanation. For a given scenario where it appears that a entails b, an abductive investigator assumes that given that a is true, then b is true. For example:

I observe that the document properties contain Acme’s name, and it is likely that such information was filled in by Microsoft Word automatically when the document was created at Acme; therefore, this document was created while Anne was at Acme.

A generous way of describing abductive reasoning is to say that it is an application of Occam’s Razor, which states that “Unless there is reason to believe otherwise, the simplest solution is the best.” A more accurate description is to say that abductive reasoning is unsound: we cannot (or at least, should not) assume ahead of time the point we trying to prove. In reality, an investigation’s hypothesis often starts this way, but it is not a line of reasoning that is worth testifying over. A sound forensic method shares much in common with the scientific method in that observations confirm or refute a hypothesis that explains events.

Deductive Reasoning: the investigator reasons about the crime by constructing truths based on axiomatic assumptions. A deductive investigator assumes a general truth a, and derives b as a consequence. For example:

I assume that Microsoft Word always automatically fills in a document’s properties with the author’s personal information; therefore, we can deduce that this document was created while Adams was at Acme.

The problem with deductive reasoning is that we are at the mercy of our assumptions: if they are wrong, our conclusions may be wrong as well.

Inductive Reasoning: the investigator reasons about the crime by what is observed to be true independent of the case. Here, we assume it is true that b follows from a because we perceive that to be the case repeatedly. In our scenario, an inductive investigator makes the claim:

From my repeated experience, I hypothesize that Microsoft Word always automatically fills in a document’s properties with the author’s personal information; therefore, we can infer that this document is a instance of my hypothesis and was created while Adams was at Acme.

The limitation of inductive reasoning is that we are at the mercy of our observations: if they are not an accurate sample of all possible outcomes, then we risk inferring the wrong conclusion. The good news is that scientists know how to perform careful observations and to draw conclusions appropriately. It is important to note, however, that just because we have used inductive reasoning does not mean we are correct: the above hypothesis ignores the possibility that Adams used a version of program installed by Acme but after she left the company. Moreover, a third party can easily change the information.

Which of these three types of reasoning did you apply to formulate and justify a hypothesis of Adams’s alleged crime? Was your reasoning based on the presence of the serial number; in real life would you have verified that the particular camera stamped each photo with its serial number? Did you know that EXIF information is as easily modified as Word document properties (or any other unsigned digital data)?

Forensics is a science when inductive reasoning is used. Inductive reasoning is strongest when hypotheses are repeatedly verified by independent parties. Investigators rely heavily on validation studies that perform repeated and precise tests on equipment and software to determine what can be said with assurance about evidence. Validation reports are published by government agencies, such as the National Institute of Standards and Technology (NIST), and by industry and academic researchers in peer-reviewed journals and conference proceedings.

Finally, we note that the most conservative view of inductive reasoning is that the investigator’s theory can only be negative – much like the traditional scientific method where we can at best reject the null hypothesis.

That is, we gain knowledge from only theories that are provably false. For example, we can be sure that the theory that Adams created the document at some third company is false if she never stepped foot in that company’s door. Any theory that Adams created the document at Acme is true only in the sense that we haven’t yet observed evidence that proves it false. This viewpoint is quite pessimistic! However, it is important to understand that more often than not, the complete facts are hidden from the investigator and although a hypothesis fits, it does not mean it is correct. That analysis is left for judge and jury and not the investigator.

Digital evidence is circumstantial

There are many advantages to digital evidence, but investigators, and courts, must realize there are many limitations. The primary limitation is that digital evidence is often circumstantial — it is indirect evidence of an event, and we can infer a fact from its presence. For example, in the Adams case, all our evidence was circumstantial. We did not use the content of the photo as direct evidence; we used information recorded in the EXIF tags to infer its origin. From a legal perspective, “direct” evidence is directly observable and speaks for itself – direct evidence proves a fact without inference. Examples of direct evidence include photos, video, recorded sound, DNA, and human witnesses to an event.

Digital evidence is often modifiable. In the Adams case, we assumed that her copy of MS Windows placed Acme’s information in the properties of the Word document; however, you should check for yourself that you are able to modify that evidence quite easily and save the document. The new document would have a new timestamp, but you can trivially get around this problem by resetting your computer’s clock, or less trivially by editing the file’s timestamp directly.

Many later homework assignments in this class will contain evidence generated by your instructor! While it was time consuming to create these assignments, it was not difficult, as the evidence is embedded automatically by programs in their normal course of operation. Regardless, it is important to realize that digital evidence should not be considered as absolute fact when it is found (as one might consider DNA evidence), but that does not make it weaker than other types of circumstantial evidence.

You might now be asking, is there any value at all to digital evidence then? The answer is yes.

First, most evidence at a crime scene is indirect evidence. For any crime for which there are not witnesses, the case must be circumstantial. Moreover, direct evidence from witnesses is not always reliable; people do not have perfect perception or memory and are often biased.

Second, indirect evidence can be strong if there is other corroborating evidence — a notion that Locard’s Exchange Principal speaks to. For example, let’s say that it has been alleged that John committed a crime against Jane, and John claims to not know her at all. Investigators find that John’s Web browser has a history of pages he has visited recently, including the text and images from those pages. Jane’s public Web page is found in that history cache, and it is used as indirect evidence that he knew Jane before the crime took place. John’s browser will record when exactly John last visited Jane’s page, and such facts can be corroborated by examining the Web server that hosts Jane’s Web page. Furthermore, as we will discuss later in the semester, other logs at John’s Internet Service Provider may be able to confirm indirectly that his computer was connected to the Internet at the time the page was viewed. Logs from John’s email server may indicate he checked or sent email at the time when the Web page was retrieved; if he admits to keeping his account and password secret from others, then the email server logs indicate it was he at the keyboard at the time.

Third, circumstantial evidence can lead to direct evidence. Other stored pages in John’s Web browser history may lead investigators to John’s friend, who may confirm directly that John knew Jane. Moreover, when presented with indirect evidence, suspects may be persuaded to confess to a crime. As any law enforcement investigator will tell you, a confession is just as good or better than pursuing a guilty verdict at a trial.

Investigations

Investigation is the core mechanism of digital forensics, but not all investigations involve an alleged crime. Generally, the techniques and processes presented in this class are applicable to five types of investigations.

Criminal law. When a state or federal law are allegedly violated, investigators must follow specific procedures for gathering evidence including the use of warrants or subpoenas. The collected evidence and analysis is destined for presentation in a court room.
Civil law. In scenarios where private wrongs (torts) are alleged and compensation is desired from one party by another, private investigators and attorneys may be hired to bring a case to civil court. Such cases commonly involve contract disputes, divorces, and other non-criminal issues. Juries in civil cases must only find a preponderance of evidence and only a majority of the jury must agree on a verdict.
Incident response. When unauthorized access to a computer system or a collection of data is alleged, the investigator’s work is typically to identify the technical mechanism or human action that was violated so that systems or processes can be repaired to prevent future incidents. Possibly, the results will be passed to a criminal or civil process.
Intelligence gathering. Here the goal for investigators is to use forensic techniques to gather information from systems and documents. Along with other processes for verification, corroboration, and validation, the raw information is converted to intelligence. Rarely will the results be used in a criminal or civil proceedings.
Malicious activity. All the knowledge and skills that can be gained from the study of forensics can be use malicious purposes to invade your privacy. It’s important to understand the information you might leave behind for an adversary to recover.

Criminal investigators have the most restrictions on their actions, and their results will come under the most scrutiny, whether as part of a prosecution or a defense. Civil investigations have a similarly high bar, although prosecutors in civil hearings must present only a preponderance of evidence to a judge or jury. Incident response investigations are typically carried out by the owners of the systems or data that was violated and are not in the context of the laws that bind the actions of criminal and civil investigations. Although precision is also the goal of an investigator, a best guess is more acceptable in this context. Finally, intelligence gathering have few constraints, and the results inform a policy or strategy by an organization or government. Malicious activity that seeks to invade someone’s privacy may obey no constraints, and is thus a potent threat.

Data representation

There is nothing either good or bad, thinking makes it so.
- Hamlet

Bits are bits, and ultimately its people who decide what they mean. The same sequence of bits can mean different things in different context, and different sequences can mean the same thing in others. Let’s start with some examples outside of the CS context.

When we talk of numbers, we often conflate the concept of the number and the numeral, that is, the symbol we write to represent the number. But these are very different things. For example, we might represent the number three as:

3 (the traditional, base 10 notation)
11 (the base 2, that is, binary notation)
||| (three scratch marks)
iii (or III) (roman numerals)

Or, we might see the character “I” and, without more context, not know whether it’s referring to “oneself” or the number one, or something else entirely.

So back to bits. Same story here, it turns out: we need some context to decide what a sequence of bits (or bytes) means.

We’ll return to this topic in more detail later in the course, but here is a brief list of the sorts of things we need to worry about.

Numerical (integer) values

First, numerical integer data is typically stored as signed or unsigned. Let’s look at bytes first. A byte is eight bits, so a integer stored in an unsigned byte is stored exactly how you’d expect, paralleling the way we write base-10 numbers. The “low-order” (rightmost) bit is the 2^0 (“ones”) place; the “high-order” bit is the 2^7 (“one-twenty-eights”) place. Unsigned values do not have a sign (positive or negative) and are usually considered to be positive.

Signed values need to spare a bit to maintain sign information. They are not stored as a 7-bit unsigned value with a 1-bit positive-or-negative bit. Instead, they are stored in a format called “two’s complement”. For an N-bit number, the two’s complement is equal to 2^N - the number. You can also compute it quickly by taking the ones’ complement (the “normal” definition of complement) and then adding one. We use this format because it allows us to use a normal ALU’s adder to add positive and negative numbers without more gates.

So that’s single byte values. What about two byte (16 bit) values, or larger? They mostly work the same way. Mostly. There’s one more complication: endianness.

The issue is as follows. No matter what, within a byte, we have an obvious low and high order bit. But within a multi-byte value, which is the low-order byte and which is the high-order byte?

If the high-order (“bigger”) bytes some first in the left-to-right reading, we call it “big endian” (as in, big end first). Network data is usually big endian. If the low-order bytes come first, we call it “little endian”. Intel x86 CPUs operate on data in little endian format. Why do some systems use one and some the other? Take an architecture course to find out, we’re not going to cover it here.

Anyway, many but not all data formats specify the endianness of bytes they store explicitly, but some vary based upon the local CPU’s architecture – whenever you are examining binary data, you need to keep the endianness of the data in mind.

In any case, two-byte integer values are in some programming languages called “short”s; four-byte values are usually the default type for ints, and eight-byte values are called longs.

How do we display these values succinctly? Representing them as strings of 0s and 1s is unwieldy and hard to manage. Instead, we typically use something called hexadecimal notation. In “hex” as it’s sometimes called, we represent each byte (8-bits) as two hexadecimal characters. Hex characters range from 0-9, then from a-f for the next 6 values, for a total of 16 possible values. 16 values can encode 4 bits, so we need two such characters to encode 8 bits.

For example, the value 150 fits in one unsigned byte (it’s less than 256). In bits, it is:

10010110

(128 + 16 + 4 + 2), or you can ask python (bin(150)).

To convert to hex, take the first four bits and interpret them as a value: 1001 in base 2 is 8 + 1 = 9. Then the next four: 0110 = 4 + 2 = 6. So 150 in decimal is 96 in hex. Usually we prefix hex values with 0x, so we’d write it as 0x96 (and if you type this into a python interpreter, it will give you back the value 150, since numbers are displayed by default in base 10).

String data

As you saw (or will see) in the second homework assignment, sometimes we choose to interpret a byte (or bytes) as characters. In Ye Olden Days of the United States, we could ignore the rest of the world, and encode all of our characters into a 7-bit code called ASCII (see https://en.wikipedia.org/wiki/ASCII#Code_chart)

Note that many characters in ASCII are actually control codes; some date back to controlling old teletypes (for example, why do we have a “carriage return” character?). Anyway, each ASCII value corresponds to something; some of those things are the traditional set of keys on a US keyboard. The hex value 0x41, for example, decodes to the integer 65, but under ASCII is also the code for an uppercase A. A major benefit of ASCII is that it encodes characters to bytes in a one-to-one (fixed-width) format, which makes programmers’ lives easier in some minor ways, particularly in the olden days with C-style NUL-terminated strings.

There are many (many, many) ways to encode characters, and most are better than ASCII in that they permit the encoding of other symbols, characters, and so on. Mostly but not entirely, the world has settled on using Unicode to name each character and some non-characters like emoji (in particular by mapping each character or codepoint to a particular integer value), and encoded these integers into bytes by using one of the defined “Unicode Transformation Formats” like UTF-8, which is The Best Such Format.

We’ll talk more about this later, but in essence, UTF-8 is a variable-length encoding that is backwards-compatible with ASCII. The first 127 values are ASCII and can be represented in one byte, just like ASCII. Characters with values greater than 127 are encoded into between two and four bytes. In short, if the first byte of a character has the high bit set, it’s a multi-byte character that is decoded according to rules you can read about in the spec or on Wikipedia’s UTF-8 page. Many older forensics tools only understand ASCII, but we’ll deal with some of the complications of UTF later in the course.

`hexdump`

Bringing these two ideas together: Sometimes we want to examine the underlying data in a file, like our recovered Design.doc file or maybe a jpeg. The canonical way to do this is through a utility called hexdump, which displays the contents of a file. Here’s a quick example (demo). You’re going to reimplement this program in python for assignment 2.

Some practical python

A quick overview of some useful things to know about Python. (Most code examples below are from https://learnxinyminutes.com/docs/python3/ which is worth looking over in detail.)

You can start the python interpreter by typing python at the command line:

> python
Python 2.7.12 (default, Nov 19 2016, 06:48:10) 
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>>

But notice on this machine, python defaults to Python2, not Python3. Press CTRL-D (or on Windows, CTRL-Z) to exit the interpreter, then if needed, use python3 or python3.5 to get the right version:

> python3
Python 3.5.2 (default, Nov 17 2016, 17:05:23) 
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>

The interpreter runs in a REPL: a Read-Eval-Print Loop, which is excellent for exploratory programming. You type in an expression (it’s “read”), the interpreter “eval”uates it (and updates state if necessary), then the result of the expression is “print”ed, and the process starts again.

Jupyter is a nice way to use Python interactively, too. Alternates between command mode and edit mode; in command mode there’s a similar help / completion utility as in IPython. Start it from the command line using something like jupyter notebook; the exact command may vary depending upon how you installed it.

You might also consider a “real” IDE like PyCharm.