03: Forensic Science; Data representation

Announcements

The most important thing to know today: the course web site is at https://people.cs.umass.edu/~liberato/courses/2017-spring-compsci365/. It is the syllabus for this class and you are expected to read it in its entirety. (This announcement will go away after add/drop ends.)

Today we'll finish our motivating example, talk about why and how digital forensics can be a science, talk a little bit about data representation, and run through some practical Python stuff that will help you on the hexdump assignment that was posted Monday.

What makes forensics a science?

Throughout this class, we will present many techniques for recovering forensic evidence from computer systems. The skills you will learn can be applied in many different scenarios. For example, recovery of erased data is useful simply when data is deleted accidentally and there does not need to be a crime involved. Under what conditions is the practice of forensics a science rather than a series of related techniques that recover data?

When the investigator follows a repeatable, structured process for gathering evidence and uses strong inductive reasoning to reach conclusions, as we explain below. Specifically, a scientific investigator makes three critical steps in investigations.

  1. The process begins when an investigator has judged that an alleged crime or other event is worth investigating.
  2. Next, the investigator gathers evidence.
  3. Finally, a hypothesis is supported that best explains the events that took place.

The first step is a dictated by the resources available to the investigator and the law’s definition of crime, civil law, or a company’s internal policy. For example, there are more crimes and criminals than law enforcement can handle and investigations often have to be prioritized.

Evidence gathered in Step 2 has its origin in the transfer of artifacts at a crime scene that is described by Locard's Exchange Principle. How evidence is gathered will greatly affect the validity of the results; however, this is a topic we leave for later discussions. The investigator attempts to refine artifacts into evidence in several stages: identification: determining an artifact’s class characteristics; individualization: narrowing the class to one; association: linking a person with a crime scene through the individualized evidence; intentionality: inferring the intent of the person. Each stage is challenging: not all artifacts can be found, and intent if often hard to discern.

In the last step, the investigator finds that a hypothesis can be supported by the evidence that answers questions specific to forensics: what crime, who did it, what was their intent, what were their specific actions? The type of reasoning employed by the investigators to answer these questions determines whether forensics is a science. In general, there are three options, but only the last option provides a strong argument.

Abductive Reasoning: the investigator reasons about the crime based on the most likely explanation. For a given scenario where it appears that a entails b, an abductive investigator assumes that given that a is true, then b is true. For example:

I observe that the document properties contain Acme’s name, and it is likely that such information was filled in by Microsoft Word automatically when the document was created at Acme; therefore, this document was created while Anne was at Acme.

A generous way of describing abductive reasoning is to say that it is an application of Occam’s Razor, which states that “Unless there is reason to believe otherwise, the simplest solution is the best.” A more accurate description is to say that abductive reasoning is unsound: we cannot (or at least, should not) assume ahead of time the point we trying to prove. In reality, an investigation’s hypothesis often starts this way, but it is not a line of reasoning that is worth testifying over. A sound forensic method shares much in common with the scientific method in that observations confirm or refute a hypothesis that explains events.

Deductive Reasoning: the investigator reasons about the crime by constructing truths based on axiomatic assumptions. A deductive investigator assumes a general truth a, and derives b as a consequence. For example:

I assume that Microsoft Word always automatically fills in a document’s properties with the author’s personal information; therefore, we can deduce that this document was created while Adams was at Acme.

The problem with deductive reasoning is that we are at the mercy of our assumptions: if they are wrong, our conclusions may be wrong as well.

Inductive Reasoning: the investigator reasons about the crime by what is observed to be true independent of the case. Here, we assume it is true that b follows from a because we perceive that to be the case repeatedly. In our scenario, an inductive investigator makes the claim:

From my repeated experience, I hypothesize that Microsoft Word always automatically fills in a document’s properties with the author’s personal information; therefore, we can infer that this document is a instance of my hypothesis and was created while Adams was at Acme.

The limitation of inductive reasoning is that we are at the mercy of our observations: if they are not an accurate sample of all possible outcomes, then we risk inferring the wrong conclusion. The good news is that scientists know how to perform careful observations and to draw conclusions appropriately. It is important to note, however, that just because we have used inductive reasoning does not mean we are correct: the above hypothesis ignores the possibility that Adams used a version of program installed by Acme but after she left the company. Moreover, a third party can easily change the information.

Which of these three types of reasoning did you apply to formulate and justify a hypothesis of Adams’s alleged crime? Was your reasoning based on the presence of the serial number; in real life would you have verified that the particular camera stamped each photo with its serial number? Did you know that EXIF information is as easily modified as Word document properties?

Forensics is a science when inductive reasoning is used. Inductive reasoning is strongest when hypotheses are repeatedly verified by independent parties. Investigators rely heavily on validation studies that perform repeated and precise tests on equipment and software to determine what can be said with assurance about evidence. Validation reports are published by government agencies, such as the National Institute of Standards and Technology (NIST), and by industry and academic researchers in peer-reviewed journals and conference proceedings.

Finally, we note that the most conservative view of inductive reasoning is that the investigator’s theory can only be negative. That is, we gain knowledge from only theories that are provably false. For example, we can be sure that the theory that Adams created the document at some third company is false if she never stepped foot in that company’s door. Any theory that Adams created the document at Acme is true only in the sense that we haven’t yet observed evidence that proves it false. This viewpoint is quite pessimistic! However, it is important to understand that more often than not, the complete facts are hidden from the investigator and although a hypothesis fits, it does not mean it is correct. That analysis is left for judge and jury and not the investigator.

Digital evidence is circumstantial

There are many advantages to digital evidence, but investigators, and courts, must realize there are many limitations. The primary limitation is that digital evidence is often circumstantial — it is indirect evidence of an event, and we can infer a fact from its presence. For example, in the Adams case, all our evidence was circumstantial. We did not use the content of the photo as direct evidence; we used information recorded in the EXIF tags to infer its origin. In contrast, direct evidence proves a fact without inference. Examples of direct evidence include photos, video, recorded sound, DNA, and human witnesses to an event.

Digital evidence is often modifiable. In the Adams case, we assumed that her copy of MS Windows placed Acme’s information in the properties of the Word document; however, you should check for yourself that you are able to modify that evidence quite easily and save the document. The new document would have a new timestamp, but you can get around this problem by resetting your computer’s clock.

Many later homework assignments in this class will contain evidence generated by your instructor! While it was time consuming to create these assignments, it was not difficult, as the evidence is embedded automatically by programs in their normal course of operation. Regardless, it is important to realize that digital evidence should not be considered as absolute fact when it is found (as one might consider DNA evidence), but that does not make it weaker than other types of circumstantial evidence.

You might now be asking, is there any value at all to digital evidence then? The answer is yes.

First, most evidence at a crime scene is indirect evidence. For any crime for which there are not witnesses, the case must be circumstantial. Moreover, direct evidence from witnesses is not always reliable; people do not have perfect perception or memory and are often biased.

Second, indirect evidence can be strong if there is other corroborating evidence — a notion that Locard's Exchange Principal speaks to. For example, let’s say that it has been alleged that John committed a crime against Jane, and John claims to not know her at all. Investigators find that John’s Web browser has a history of pages he has visited recently, including the text and images from those pages. Jane’s public Web page is found in that history cache, and it is used as indirect evidence that he knew Jane before the crime took place. John’s browser will record when exactly John last visited Jane’s page, and such facts can be corroborated by examining the Web server that hosts Jane’s Web page. Furthermore, as we will discuss later in the semester, other logs at John’s Internet Service Provider may be able to confirm indirectly that his computer was connected to the Internet at the time the page was viewed. Logs from John’s email server may indicate he checked or sent email at the time when the Web page was retrieved; if he admits to keeping his account and password secret from others, then the email server logs indicate it was he at the keyboard at the time.

Third, circumstantial evidence can lead to direct evidence. Other stored pages in John’s Web browser history may lead investigators to John’s friend, who may confirm directly that John knew Jane. Moreover, when presented with indirect evidence, suspects may be persuaded to confess to a crime. As any law enforcement investigator will tell you, a confession is just as good or better than pursuing a guilty verdict at a trial.

Investigations

Investigation is the core mechanism of digital forensics, but not all investigations involve an alleged crime. Generally, the techniques and processes presented in this class are applicable to five types of investigations.

  • Criminal law. When a state or federal law are allegedly violated, investigators must follow specific procedures for gathering evidence including the use of warrants or subpoenas. The collected evidence and analysis is destined for presentation in a court room.
  • Civil law. In scenarios where private wrongs (torts) are alleged and compensation is desired from one party by another, private investigators and attorneys may be hired to bring a case to civil court. Such cases commonly involve contract disputes, divorces, and other non-criminal issues. Juries in civil cases must only find a preponderance of evidence and only a majority of the jury must agree on a verdict.
  • Incident response. When unauthorized access to a computer system or a collection of data is alleged, the investigator’s work is typically to identify the technical mechanism or human action that was violated so that systems or processes can be repaired to prevent future incidents. Possibly, the results will be passed to a criminal or civil process.
  • Intelligence gathering. Here the goal for investigators is to use forensic techniques to gather information from systems and documents. Along with other processes for verification, corroboration, and validation, the raw information is converted to intelligence. Rarely will the results be used in a criminal or civil proceedings.
  • Malicious activity. All the knowledge and skills that can be gained from the study of forensics can be use malicious purposes to invade your privacy. It’s important to understand the information you might leave behind for an adversary to recover.

Criminal investigators have the most restrictions on their actions, and their results will come under the most scrutiny, whether as part of a prosecution or a defense. Civil investigations have a similarly high bar, although prosecutors in civil hearings must present only a preponderance of evidence to a judge or jury. Incident response investigations are typically carried out by the owners of the systems or data that was violated and are not in the context of the laws that bind the actions of criminal and civil investigations. Although precision is also the goal of an investigator, a best guess is more acceptable in this context. Finally, intelligence gathering have few constraints, and the results inform a policy or strategy by an organization or government. Malicious activity that seeks to invade someone’s privacy may obey no constraints, and is thus a potent threat.

Data representation

There is nothing either good or bad, thinking makes it so.
- Hamlet

Bits are bits, and ultimately its people who decide what they mean. The same sequence of bits can mean different things in different context, and different sequences can mean the same thing in others. Let's start with some examples outside of the CS context.

When we talk of numbers, we often conflate the concept of the number and the numeral, that is, the symbol we write to represent the number. But these are very different things. For example, we might represent the number three as:

  • 3 (the traditional, base 10 notation)
  • 11 (the base 2, that is, binary notation)
  • ||| (three scratch marks)
  • iii (or III) (roman numerals)

Or, we might see the character "I" and, without more context, not know whether it's referring to "oneself" or the number one, or something else entirely.

So back to bits. Same story here, it turns out: we need some context to decide what a sequence of bits (or bytes) means.

We'll return to this topic in more detail later in the course, but here is a brief list of the sorts of things we need to worry about.

Numerical (integer) values

First, numerical integer data is typically stored as signed or unsigned. Let's look at bytes first. A byte is eight bits, so a integer stored in an unsigned byte is stored exactly how you'd expect, paralleling the way we write base-10 numbers. The "low-order" (rightmost) bit is the 2^0 ("ones") place; the "high-order" bit is the 2^7 ("one-twenty-eights") place. Unsigned values do not have a sign (positive or negative) and are usually considered to be positive.

Signed values need to spare a bit to maintain sign information. They are not stored as a 7-bit unsigned value with a 1-bit positive-or-negative bit. Instead, they are stored in a format called "two's complement". For an N-bit number, the two's complement is equal to 2^N - the number. You can also compute it quickly by taking the ones' complement (the "normal" definition of complement) and then adding one. We use this format because it allows us to use a normal ALU's adder to add positive and negative numbers without more gates.

So that's single byte values. What about two byte (16 bit) values, or larger? They mostly work the same way. Mostly. There's one more complication: endianness.

The issue is as follows. No matter what, within a byte, we have an obvious low and high order bit. But within a multi-byte value, which is the low-order byte and which is the high-order byte?

If the high-order ("bigger") bytes some first in the left-to-right reading, we call it "big endian" (as in, big end first). Network data is usually big endian. If the low-order bytes come first, we call it "little endian". Intel x86 CPUs operate on data in little endian format. Why do some systems use one and some the other? Take an architecture course to find out, we're not going to cover it here.

Anyway, many but not all data formats specific the endianness of bytes they store explicitly, but some vary based upon the local CPU's architecture -- whenever you are examining binary data, you need to keep the endianness of the data in mind.

In any case, two-byte integer values are in some programming languages called "short"s; four-byte values are usually the default type for ints, and eight-byte values are called longs.

String data

As you saw (or will see) in the second homework assignment, sometimes we choose to interpret a byte (or bytes) as characters. In Ye Olden Days of the United States, we could ignore the rest of the world, and encode all of our characters into a 7-bit code called ASCII (see https://en.wikipedia.org/wiki/ASCII#Code_chart)

Note that many characters in ASCII are actually control codes; some date back to controlling old teletypes (for example, why do we have a "carriage return" character?). Anyway, each ASCII value corresponds to something; some of those things are the traditional set of keys on a US keyboard. The hex value 0x41, for example, decodes to the integer 65, but under ASCII is also the code for an uppercase A. A major benefit of ASCII is that it encodes characters to bytes in a one-to-one (fixed-width) format, which makes programmers' lives easier in some minor ways, particularly in the olden days with C-style NUL-terminated strings.

There are many (many, many) ways to encode characters, and most are better than ASCII in that they permit the encoding of other symbols, characters, and so on. Mostly but not entirely, the world has settled on using Unicode to name each character and some non-characters like emoji (in particular by mapping each character or codepoint to a particular integer value), and encoded these integers into bytes by using one of the defined "Unicode Transformation Formats" like UTF-8, which is The Best Such Format.

We'll talk more about this later, but in essence, UTF-8 is a variable-length encoding that is backwards-compatible with ASCII. The first 127 values are ASCII and can be represented in one byte, just like ASCII. Characters with values greater than 127 are encoded into between two and four bytes. In short, if the first byte of a character has the high bit set, it's a multi-byte character that is decoded according to rules you can read about in the spec or on Wikipedia's UTF-8 page. Many older forensics tools only understand ASCII, but we'll deal with some of the complications of UTF later in the course.

Some practical python

A quick overview of some useful things to know about Python. (Most code examples below are from https://learnxinyminutes.com/docs/python3/ which is worth looking over in detail.)

You can start the python interpreter by typing python at the command line:

> python
Python 2.7.12 (default, Nov 19 2016, 06:48:10) 
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>>

But notice on this machine, python defaults to Python2, not Python3. Press CTRL-D (or on Windows, CTRL-Z) to exit the interpreter, then if needed, use python3 or python3.5 to get the right version:

> python3
Python 3.5.2 (default, Nov 17 2016, 17:05:23) 
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> 

The interpreter runs in a REPL: a Read-Eval-Print Loop, which is excellent for exploratory programming. You type in an expression (it's "read"), the interpreter "eval"uates it (and updates state if necessary), then the result of the expression is "print"ed, and the process starts again.

# You have numbers
3  # => 3

# Math is what you would expect
1 + 1   # => 2
8 - 1   # => 7
10 * 2  # => 20
35 / 5  # => 7.0

# Equality is ==
1 == 1  # => True
2 == 1  # => False

# Inequality is !=
1 != 1  # => False
2 != 1  # => True

# (is vs. ==) is checks if two variables refer to the same object, but == checks
# if the objects pointed to have the same values.
a = [1, 2, 3, 4]  # Point a at a new list, [1, 2, 3, 4]
b = a             # Point b at what a is pointing to
b is a            # => True, a and b refer to the same object
b == a            # => True, a's and b's objects are equal
b = [1, 2, 3, 4]  # Point b at a new list, [1, 2, 3, 4]
b is a            # => False, a and b do not refer to the same object
b == a            # => True, a's and b's objects are equal

# Strings are created with " or '
"This is a string."
'This is also a string.'

# Strings can be added too! But try not to do this.
"Hello " + "world!"  # => "Hello world!"
# Strings can be added without using '+'
"Hello " "world!"    # => "Hello world!"

# A string can be treated like a list of characters
"This is a string"[0]  # => 'T'

# You can find the length of a string
len("This is a string")  # => 16

# .format can be used to format strings, like this:
"{} can be {}".format("Strings", "interpolated")  # => "Strings can be interpolated"

# You can repeat the formatting arguments to save some typing.
"{0} be nimble, {0} be quick, {0} jump over the {1}".format("Jack", "candle stick")
# => "Jack be nimble, Jack be quick, Jack jump over the candle stick"

# You can use keywords if you don't want to count.
"{name} wants to eat {food}".format(name="Bob", food="lasagna")  # => "Bob wants to eat lasagna"

# You can use the format specification mini-language to specially format some values (usually numerical). A pertinent example:

"{:08x}".format(1024)
"0x{:08x}".format(1024)

Let's look at an alternative interpreter you can install, IPython:

> ipython
Python 3.5.2 (default, Dec 17 2016, 06:22:44) 
Type "copyright", "credits" or "license" for more information.

IPython 5.1.0 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.

In [1]:

IPython shows input and output and numbers them. While the default REPL is better than it used to be (it now does scrollback and tab-completion), IPython is still quite a bit better. Some quick things: Tab completion shows options; and the ? to check docs on something quickly.

# Python has a print function
print("I'm Python. Nice to meet you!")  # => I'm Python. Nice to meet you!

# By default the print function also prints out a newline at the end.
# Use the optional argument end to change the end character.
print("Hello, World", end="!")  # => Hello, World!

# No need to declare variables before assigning to them.
# Convention is to use lower_case_with_underscores
some_var = 5
some_var  # => 5

# Lists store sequences
li = []
# You can start with a prefilled list
other_li = [4, 5, 6]

# Add stuff to the end of a list with append
li.append(1)    # li is now [1]
li.append(2)    # li is now [1, 2]
li.append(4)    # li is now [1, 2, 4]
li.append(3)    # li is now [1, 2, 4, 3]

# You can look at ranges with slice syntax.
# (It's a closed/open range for you mathy types.)
li[1:3]   # => [2, 4]
# Omit the beginning
li[2:]    # => [4, 3]
# Omit the end
li[:3]    # => [1, 2, 4]
# Select every second entry
li[::2]   # =>[1, 4]
# Return a reversed copy of the list
li[::-1]  # => [3, 4, 2, 1]
# Use any combination of these to make advanced slices
# li[start:end:step]

# Remove arbitrary elements from a list with "del"
del li[2]  # li is now [1, 2, 3]

# Tuples are like lists but are immutable.
tup = (1, 2, 3)

# Tuples are like lists but are immutable.
tup = (1, 2, 3)
tup[0]      # => 1
tup[0] = 3  # Raises a TypeError


# Dictionaries store mappings
empty_dict = {}
# Here is a prefilled dictionary
filled_dict = {"one": 1, "two": 2, "three": 3}

# Look up values with []
filled_dict["one"]  # => 1

# Get all keys as an iterable with "keys()". We need to wrap the call in list()
# to turn it into a list. We'll talk about those later.  Note - Dictionary key
# ordering is not guaranteed. Your results might not match this exactly.
list(filled_dict.keys())  # => ["three", "two", "one"]


# Get all values as an iterable with "values()". Once again we need to wrap it
# in list() to get it out of the iterable. Note - Same as above regarding key
# ordering.
list(filled_dict.values())  # => [3, 2, 1]

# Check for existence of keys in a dictionary with "in"
"one" in filled_dict  # => True
1 in filled_dict      # => False

# Looking up a non-existing key is a KeyError
filled_dict["four"]  # KeyError

# Use "get()" method to avoid the KeyError
filled_dict.get("one")      # => 1
filled_dict.get("four")     # => None
# The get method supports a default argument when the value is missing
filled_dict.get("one", 4)   # => 1
filled_dict.get("four", 4)  # => 4

# Adding to a dictionary
filled_dict.update({"four":4})  # => {"one": 1, "two": 2, "three": 3, "four": 4}
#filled_dict["four"] = 4        #another way to add to dict

# Remove keys from a dictionary with del
del filled_dict["one"]  # Removes the key "one" from filled dict

Jupyter is a nice way to use Python interactively, too. Alternates between command mode and edit mode; in command mode there's a similar help / completion utility as in IPython. Start it from the command line using something like jupyter notebook; the exact command may vary depending upon how you installed it.

# Let's just make a variable
some_var = 5

# Here is an if statement. Indentation is significant in python!
# prints "some_var is smaller than 10"
if some_var > 10:
    print("some_var is totally bigger than 10.")
elif some_var < 10:    # This elif clause is optional.
    print("some_var is smaller than 10.")
else:                  # This is optional too.
    print("some_var is indeed 10.")


"""
For loops iterate over lists
prints:
    dog is a mammal
    cat is a mammal
    mouse is a mammal
"""
for animal in ["dog", "cat", "mouse"]:
    # You can use format() to interpolate formatted strings
    print("{} is a mammal".format(animal))

"""
"range(number)" returns an iterable of numbers
from zero to the given number
prints:
    0
    1
    2
    3
"""
for i in range(4):
    print(i)

# Use "def" to create new functions
def add(x, y):
    print("x is {} and y is {}".format(x, y))
    return x + y  # Return values with a return statement

# Calling functions with parameters
add(5, 6)  # => prints out "x is 5 and y is 6" and returns 11

# We can use list comprehensions for nice maps and filters
# List comprehension stores the output as a list which can itself be a nested list
[add_10(i) for i in [1, 2, 3]]         # => [11, 12, 13]
[x for x in [3, 4, 5, 6, 7] if x > 5]  # => [6, 7]

# You can import modules
import math
print(math.sqrt(16))  # => 4.0

# You can get specific functions from a module
from math import ceil, floor
print(ceil(3.7))   # => 4.0
print(floor(3.7))  # => 3.0

# You can import all functions from a module.
# Warning: this is not recommended
from math import *

# You can shorten module names
import math as m
math.sqrt(16) == m.sqrt(16)  # => True

You might also consider a "real" IDE like PyCharm.

If you want to learn more, Dive Into Python is a pretty good resource.