01: Introduction

Welcome

Hello and welcome!

I’m Marc Liberatore liberato@cs.umass.edu and I’m your instructor for this course, COMPSCI 590F.

The most important thing to know today: the course web site is at http://people.cs.umass.edu/~liberato/courses/2019-spring-compsci590F/. It includes the syllabus for this class and you are expected to read it in its entirety.

If you are watching this on-line, great! Be sure to say hello on Piazza!

Who am I?

Not a “professor.” “Doctor” if you must though I prefer just “Marc.” I am a member of the teaching faculty here at UMass.

Stuff I do:

  • Privacy research: my dissertation was about attacks on Tor, and ways to improve it
  • Forensics research: my research group builds tools and technologies to help law enforcement lawfully collect evidence of digital crime
  • Other research: location privacy in cell phones, bitcoin mixing, etc.
  • Teaching: I have taught various classes in this department: this course, 190D (now 186), the dread 187, crime law, AI, networking, etc.
  • NSF Cybercorps SFS: There are scholarships for students interested in doing cybersecurity for the government. I’ll talk more about this later.

Come by office hours (Monday 9:30–11:30 in CS 318) if you want to chat.

What is this course? / who is it for?

The goal of forensics is to gather artifacts for refinement into evidence that supports or refutes a hypothesis about an alleged crime or policy violation. Done correctly, forensics represents the application of science to law. The techniques can also be abused to thwart privacy.

To quote the course description: This course introduces students to the principal activities and state-of-the-art techniques involved in developing digital forensics systems. Topics covered may include: advanced file carving and reconstruction, forensic analysis of modern filesystems, network forensics, mobile device forensics, memory forensics, and anti-forensics.

What does this mean? It means we’re going to be reading research papers from the past 10–15 years or so in various areas of digital forensics and data recovery.

We’re going to be talking about their ideas critically in class, and you’re going to be working on problem sets, toy implementations, and small projects to help you explore the ideas they present. We’ll also use various open-source systems for forensics in various ways – we’ll use them, analyze them, and possibly extend them.

You’ll notice I’m writing on the board, not using powerpoint. That’s how I roll. I will occasionally (depending upon topic, frequently) bust out the laptop for some livecoding and demos, and once in a while for illustrations, but there are no slides or powerpoints available for this course. Come to class and take notes!

Prerequisites

For undergraduates: COMPSCI 365 or 377. CS majors only (others will need to request an override). Why? Because I expect you to be mostly familiar with the things that an introductory forensics or OS class would teach: low level details about OS operation, and filesystems in particular. You should also be familiar with the low-level details of data representation in these scenarios – we can generalize from your knowledge there as need be. Unlike 365, we’re not going to be spending too much time on law or politics. In this course, we’re going to be concerned mostly (though not exclusively) with technical issues.

Further, I expect a reasonable level of programming maturity. Operationally, that means I expect you know how to design, implement, and test small-to-medium sized programs, and I won’t be helping you debug your programs for assignments in this course. This semester, we have the luxury of low enrollment. So, I will be able to structure programming-related assignments to let you use whatever language you want, within reason. I have to be able to test things on my Mac laptop, so no Windows-only stuff, but otherwise do what you like.

Finally, I expect some mathematical maturity, commensurate with the median senior in a CS program. We’re not going to be doing deep proofs of complexity or whatnot, but you should be not totally unfamiliar with basic theoretical computer science, mathematics, and statistics.

Who are you?

Attendance.

If I didn’t call your name, you’re not enrolled. This probably means you either are on the wait list, or you want to be. Make sure you talk to me after class if this is the case!

An example of well-established forensic techniques

(Thanks for Brian Levine and Clay Shields for this motivating example.)

As a newly hired forensic investigator for Locard Forensics, Inc., you have been assigned to a team led by Mr. Locard, who tells you the following information about a case already in progress.

Anne Adams worked as a designer of toys at the Acme Toy Company for over ten years, eventually becoming a senior designer. One year ago, Nadir Toy Corp. offered Adams a position as vice-president of toy design, including a large pay raise, and she took the offer. This week, Acme learned of Nadir’s newest toy, which in their view shared too much in common with a project Adams was seen working on before she left Acme: a toy rabbit. Mr. Locard has assigned you the task of verifying his hypothesis that Adams illegally copied documents describing the projects she worked on at Acme (documents owned by Acme) from her computer before she left.

Mr. Locard worked with lawyers to create a court-ordered subpoena that, under penalty of law, requires Adams to produce all her computers and storage devices. Your task as part of Locard’s team is to focus on her USB storage device. Mr. Locard has made an exact copy of the original USB device, a process called imaging. One of the advantages of digital evidence over traditional evidence is that exact copies can be made and analyzed without disturbing the original.

Presumably you know how to do a byte-for-byte copy of a drive; there are tools to do this, including the builtin dd on most Unix-compatible OSes.

All of the data on Adams’ device is now contained in a file called adams.dd; we don’t need another USB storage device to examine hers. In this case, her USB key is only a container for the digital evidence and not the evidence itself. The evidence file contains all the data that was previously on the USB key.

Goals

A forensic investigation has several goals, depending on the context. Typically, the primary goals are to

  1. Determine if there is evidence that a crime, tort, or policy violation has been committed;
  2. Identify the related events and actions that occurred;
  3. And identify who might be responsible.

In many criminal investigations, the goal of the investigator may additionally include determining the motive and intent of the perpetrator, corroborating alibis of the innocent, and verifying statements of witnesses. Moreover, criminal investigators need to preserve a demonstrable link between the artifacts we find at a crime scene and our later presentation of the evidence in court.

Our focus is on digital evidence, and so we will not detail procedures for gathering other types of evidence. It’s rare that only digital evidence is collected from a scene. Crimes scene investigation can involve gathering of chemical, ballistic, biological remains of a crime. If you are interested in these topics, Saferstein has written an excellent introductory book.

In our particular case, our goal is to locate evidence from the USB key data that demonstrate the toy rabbit was first designed by Adams while she was still employed by Acme.

The image

Feel free to download a copy of the acquired image from Ms. Adams’ USB storage device and follow along. In any case, all of the data on Adams’ device is now contained in a file called adams.dd; we don’t need her device (or another) to examine hers once we possess this identical image. In this case, her USB key is only a container for the digital evidence and not the evidence itself. The evidence file contains all the data that was previously on the USB key.

Along with the adams.dd file, we are also provided with a “hash” of the file:

4d56552f2c1125d1298c3f40d8072abc05428e897879056b1e851aa0451b1ad0

so that we can verify we have the same file as the person who acquired it from the device. What’s that?

A brief digression: hashing

How do we know the contents of the image file and the contents of the original USB key are the same? How do we know that our copy is not later corrupted? How can we tell two copies are the same?

It’s always possible to do a full, bit-for-bit comparison of two devices or files. But there are many times when it’s useful to have a small “fingerprint”-like piece of information that uniquely (with some disclaimers) identifies a particular piece of data. To generate such a fingerprint, we use cryptographic hash functions, which generate a hash or digest. A particular hash function will generate a (small) fixed-size output, that is easy for a human to check. This process is sometimes also called checksumming. Checksumming checks against unintentional corruption. Cryptographic hash functions also protect against deliberate, malicious attempts to tamper with the data.

Aside three other reasons cryptographic hash functions are useful to us in this regard:

  • First, it is highly improbable that two pieces of data (inputs) will have the same output. Note that with fixed-size outputs, it’s possible that two different inputs will produce the same hash; but hash functions are typically designed so that it’s unlikely; part of this is that they have a large enough output space (say, 160 bits). Similarly, given a hash output, it’s hard to find an input that generates it.
  • Second, they are reasonable fast for modern computers to compute. (There are some hash functions that are designed to be slower for various reasons, but not the hash functions we’re referring to in this course, like SHA-2 and so on).
  • Finally, as previously mentioned, the output is fixed size and relatively small. The input can be 1 KB or 1TB, but the output is always the same size. For example, SHA-2 produces 256 bit outputs.

Commonly used older hash functions include MD5 (128 bits) and SHA-1 (160 bits); both are known to be potentially vulnerable to a malicious adversary (though some of the attack models are not likely to be realistic for disk images). SHA-2 (256 bits) is not known to be vulnerable to malicious attempts at finding collisions.

Note there are various tools you can use to generate hashes. Here’s what I used on my Mac; these tools were installed using MacPorts:

gmd5sum adams.dd 
48a6e3ebfc375bd2e8a966690bd2d6d2  adams.dd

shasum adams.dd
1701bc76af9ff88335887e217b16c11f7289ecfc  adams.dd

shasum -a 256
4d56552f2c1125d1298c3f40d8072abc05428e897879056b1e851aa0451b1ad0  adams.dd

The Sleuth Kit (tsk)

OK; we know we have the right file. How can we dig into it? Standard command-line tools won’t give us much meaningful information:

> file adams.dd 
adams.dd: DOS/MBR boot sector, code offset 0x3c+2, OEM-ID "BSD  4.4", sectors/cluster 2, root entries 512, sectors 10239 (volumes <=32 MB), sectors/FAT 20, sectors/track 32, heads 16, serial number 0x36c013ef, label: "ADAMS      ", FAT (16 bit)

…though some of this should be familiary to you from 377, and all of it if you took 365.

So we’ll turn to The Sleuth Kit, an open source forensic toolkit written by Brian Carrier. We’ll use the fls command to get a first look into the image:

fls adams.dd 
r/r 3:  ADAMS       (Volume Label Entry)
d/d 5:  images
r/r 7:  Designs.doc
v/v 163171: $MBR
v/v 163172: $FAT1
v/v 163173: $FAT2
d/d 163174: $OrphanFiles

(An aside: naming conventions in TSK. See: http://wiki.sleuthkit.org/index.php?title=TSK_Tool_Overview but in short, tools are prefixed to tell you what level of abstraction they work on.)

What’s going on here? (See http://wiki.sleuthkit.org/index.php?title=Fls for more details.)

r and d refer to regular files and directories. v refers to virtual entries: frequently, certain artifacts are useful to investigators, and so TSK will give us virtual references to them in various contexts.

Files are stored somehow in a filesystem, and information about each file is stored in the filesystem’s metadata. In most filesystems, there are two parts to the metadata: the file name structure, and the rest of it (usually just called the metadata in TSK).

In the listing above, the first r/d refers to the type stored in the filename metadata; the second refers to the type stored in the general (last modified, location on disk, etc.) metadata. Usually they’re equal; if a file has been deleted and one of the two structures reallocated, they could differ.

The metadata structure about each file is given a unique address by most filesystems; if there is no such address in a filesystem, TSK will create (in a deterministic fashion) an address. That’s the number listed here. This is often referred to as the inode, which is the name of the structure in many Unix-derived filesystems.

Files are stored typically stored in blocks on disk; each file may take up one or more blocks. In most file systems, each block is used by only one file, even if the file is smaller than the block size. In FAT, the volume label gets its own entry (this is part of how TSK abstracts the underlying disk / partition / filesystem structure; we’ll see more on this later).

Let’s say we want to extract the Designs.doc file from this disk image. We can use icat, which is similar to the cat utility, but instead of working from standard in to standard out, it will parse a disk image, look for the file associated with a given inode, and send it to standard out. We can capture this with the shell redirection operation >:

icat adams.dd 7 > Designs.doc

Let’s immediately take a fingerprint:

shasum -a 256 Designs.doc > Designs.doc.sha256

Notice we saved it to a file for later use. other investigators can verify that it is possible to extract the identical data from the image. Additionally, the hash allows the investigation to be verified in a way that is independent of the tools we are using.

Deleted files

Let’s now take a closer look at Adams’ files. Recall that there is a subdirectory. To see a list of all files, recursively traversing the entire file system and prepending the full path to each file, issue the fls -r -p command (-r for recursive, -p for show full path).

fls -r -p adams.dd 
r/r 3:  ADAMS       (Volume Label Entry)
d/d 5:  images
r/r * 549:  images/_MG_3027.JPG
r/r 7:  Designs.doc
v/v 163171: $MBR
v/v 163172: $FAT1
v/v 163173: $FAT2
d/d 163174: $OrphanFiles

The command reveals another file, but this time, TSK puts an asterisk next to it, denoting that the file has been deleted. We can extract the file and fingerprint it using the icat and md5 commands, just as we did above. TSK knows the first block of the file was 549; however, because the file is deleted, information about the subsequent blocks in which the file was located could have been disturbed. Include the -r flag to let TSK make its best guess (use ‘r’ecovery techinques) at the complete file. Why can we do this? Because, as you know, most filesystems don’t overwrite deleted data; they just mark it as available, or free space. So as long as a new file hasn’t been “saved over” that space, the old file may still be recoverable.

icat -r adams.dd 549 > image.jpg

We can now view the file.

Locard’s exchange principle

In this case, the alleged crime is property theft, a civil dispute, and the hypothesis was suggested to us by Acme: Adams copied proprietary documents and took them with her to the Nadir Toy Corp.

To build a case, we make use of a foundational principal of criminal investigations.

Locard’s Exchange Principle: Anyone and anything entering a scene can take something of the scene with them, and they can leave something of themselves behind.

When this principle is in effect, it produces two types of evidence:

  1. Class characteristics: traits common in similar items;
  2. Individual characteristics: unique traits that can be linked to a specific person or activity with greater certainty.

The classic example that distinguishes these two categories is shoe prints: the make and model of a shoe is a class characteristic, whereas the scuff marks identify a particular shoe. A file’s extension tells us that the file was probably created by Microsoft Word, a class characteristic; the meta-data stored in the file can tell us information about who created it, an individual characteristic. To overcome the limited view presented by a partial record of events, investigators will seek to present the largest possible set of corroborating evidence that supports a hypothesis; Locard’s principle is instrumental in that goal.

Locard’s principle explains why evidence may be found, but it does not mandate that evidence is always present. It is not impossible to have a crime scene with no evidence at all: Data can be overwritten easily; fire may melt backup storage; firefighters may wash away fingerprints.

Application metadata

The Designs.doc file can be opened with Open Office or Microsoft Word. Opening the file, the contents are the design of the toy that Adams released from her Nadir Toy Corp. Now examine the document properties by selecting the menu option File→Properties.... The document has data that suggests that it was first created when Adams was at Acme Company.

This information is called application metadata. In the Designs.doc file, the metadata represents an individual characteristic that links the file’s creation and use to Acme. The file type — Microsoft Word document — is a class characteristic. As we will see throughout the class, most digital artifacts are leads that help us find stronger evidence or lead us to other witnesses. This metadata does not strongly implicate Adams. We do not know how such information was entered into the document. Perhaps she still uses a version of MS Word that was installed by Acme but then wrote this document after she left the company. The best way to use this information is to look for the same document or features of the document on the files that are contained on Adams’ former desktop computer at Acme and the backed-up archives of that machine.

EXIF data

Photos taken with digital cameras store a wealth of metadata with the image data. One format for storing extra information is the Exchangeable Image File Format. You can reveal the EXIF stored in an image, if any, using a number of tools. Most image viewer programs allow you to view the “properties” of an image. Several free tools are available to parse the EXIF data for you; one example is Phil Harvey’s exiftool, which parses much more than EXIF, it turns out.

The EXIF information can record a time when the photo was taken and modified, a GPS position, and the make and model of the camera. Note again that such information is not sufficient or reliable for making a strong case. First, we do not know if the data is real — it may have been edited by a third party. Or, the data may be real, but the timestamp may not be accurate because the clock referenced may have been set incorrectly. Similarly, we may not know if a GPS location of a photo notes when the picture was taken or when the camera was last powered on outside (GPS signals do not reach the inside of buildings).

What this information does tell us is that it may be a good idea to ask Adams if she has a Canon PowerShot S500. If she does, and it has serial number 4593234, then that is suggestive that her camera took this picture. At the same time, it does not prove Adams operated the camera when the picture was taken. If the camera has a flash card inserted in it when it is located, then we might find this same file on the flash card, which is stronger proof that a camera she owns took the photo. We might also verify that the clock is set correctly or learn details about how the camera handles GPS signals when inside buildings. Finally, a conversation with Adams about the camera will reveal more information: e.g., does she ever lend the camera to anyone? Where is the camera kept and who else has access to it?

In this case, let’s assume that a camera with a matching serial number was found to be the property of Acme Toy Corp. and was in possession of Adams (among others) while she was employed there.

Summing up

Can you form a hypothesis based on the evidence we recovered by applying Locard’s Rule and other logical inferences? Let’s enumerate what we found out in our hypothetical investigation.

  1. Through a subpoena, we recovered Adams’ personal storage device. Let’s assume that Adams confirmed that the device is hers in an interview.
  2. Through an investigation, we discovered Acme’s document on a image of the device. The document contained metadata listing Acme’s name in the document properties. Locard’s rule and these individual class characteristics suggest that Adams took this identifying information from Acme; however, we can’t be sure when she created the document: before or after she left the company.
  3. We recovered a deleted image from the storage device. Metadata in the EXIF portion of the image suggests the photo was taken with a camera with serial number 4593234 owned by Acme.

What might differ?

So, everything we just did should be old hat to you if you took 365, and even though the details might be obscure, if you’ve had an OS class, the broad strokes should make sense. But this is an “advanced forensics” course. Everything we just did was in keeping with how things worked during the “Golden Age” of digital forensics, which is coming to a close.

What’s going to be different, now and in the future? Lots! But in short:

File-specific carving and reassembly techniques: You can’t always neatly recover files. Sometimes you just get pieces – you’re missing headers or footers, or chunks in the middle. How can you reassemble them? And what do you do if critical information is missing? Text is easy, you can see what’s there and what’s not. But if the header of a JPEG or ZIP is missing, how do you decompress the data?

Streaming, sampling, and parallelism: Drives are gigantic now; just linearly reading an entire drive can take on the order of hours, and if you need to seek, etc., as is traditional in parsing, things can get a lot worse. Can we do better? Can we restructure parsing algorithms to work in a single pass? If we’re looking for specific evidence on large drives, can we sample small portions of the drive and probabilistically do well? If the parsing / etc. post-reading is CPU-bound, how best can we parallelize it to leverage the multicore world we live in?

Advanced filesystems: Most open-source and commercial tools support Windows (FAT and NTFS), and usually common Mac (HFS+) and Linux (ext3/4) filesystems. But lots has happened in the last 10–15 years in the filesystem world; most Unixes are moving away from inode-based filesystems and into pooled storage, like ZFS, btrfs, APFS, and so on. How do we handle these filesystems with their fundamentally different underlying structures?

Prioritized analysis: Golden age forensics says we should examine drives in their entirety; but increasingly there is just too much data to handle in reasonable timeframes. How can tools help investigators priortize their attention? Single pass filters? Interactive guidance towards files of interest?

Visualization: You just saw TSK’s output. There’s a web-based wrapper called Autopsy that’s marginally better. Part of the appeal of commercial tools is their point-and-click interface. But they still largely boil down to displaying lists of files and metadata. Can we do better? Interactive timelines? Apple-style “time machines”? Other approaches?

Cloud and IoT: What do we do when the data we are examining doesn’t live locally, but is instead diffused across virtual machines located at hard-to-localize data centers? What happens when evidence lives on small one-off devices, like smart bulbs, or drones, or cameras?

Mobile: Lots of “computing” is done on phones now. New OSes on phones bring new challenges: extracting data from non-standard interfaces, or new filesystems, or new applications.

Memory forensics: Sometimes the target of an investigation sets their system up to prevent investigations (e.g., full disk encryption). Forensic examiners have adapted by working on memory forensics – that is, examining the working memory of a running system. Memory is by design volatile, so decoding the structures and extracting relevant data in principled way is harder than for filesystems.

Some administrivia

Some words about how assignments and grading will work in this course.

Summaries (15%)

There will be assigned reading for most class meetings. I expect you to read before class, and write short (~4–5 sentence paragraph) summaries of each assigned paper. I also want you to write down any questions or criticisms you have of the paper, so that I can address them during our meetings.

Assignments (50%)

The majority of the workload in this course will consist of take-home problem sets. These assignments will involve writing, programming, or both.

You will be allowed to work together on assignments, so long as you clearly indicate you collaborated (and with who). The goal here is to aid in your learning, not to have you swap of problem sets, though!

We plan to give about six assignments, about one every two weeks.

Each assignment will contribute a stated number of points toward the “Assignments” portion of your course grade. Each assignment may be worth a different amount of points.

Assignments have a due date, clearly marked on the course web site. Late assignments will penalized very heavily: 1/4 credit per day (or fraction) late. Requests for extensions need to be made at least a day in advance. If you want to request an extension after a due date, I will expect a reasonable and well-documented excuse.

Midterms and exams (10/10/15%)

There will be two equally-weighted in-class midterms, dates TBA.

There will also be a cumulative final exam. You must achieve a passing grade on the final exam to pass the class.

You may not bring supplemental material to the midterms or final exam, that is, they are closed-book, and the use of notes, calculators, computers, phones, etc., is forbidden, unless otherwise explicitly stated.

Exams must be completed on your own: they are not collaborative!

Other things to note:

Lecture attendance is not optional (except inasmuch as it’s always optional; you’re adults so miss class if you need to). Get notes from a friend if you miss class. Or watch the videos.

We’re using Piazza for discussion.

At the start of the semester, I will permit laptops and the like in the classroom. If it becomes clear that they are being used for purposes not directly related to the class, I will ban them. It is unfair to distract other students with Facebook feeds, animated ads, and the like.

Regardless, I recommend taking notes by hand. Research suggests that students who take written notes in class significantly outperform students who use electronic devices to take notes.

Finally, note we might talk about topics never discussed in other CS contexts: murder, adult pornography, contraband (cases and images of child exploitation), etc. We will keep discussions at a high level. No slang, no denigration. Pretend you are at work. If you need to sit out one of these discussion please do so, no questions asked; they’re generally only to contextualize some of our work, and we won’t usually go into graphic detail. But some frank talk is unavoidable when discussing the motivation behind digital forensics.

End-of-class reminders

The course website is not yet up-to-date (sorry about that) but will be within 24 hours or so. Once it’s up…

  • Read the Syllabus on the course web site.

  • Assignments and their due date will go up on the web site as they become available. The reading will generally be up by the class before; for next class, read Simson Garfinkle’s Digital Forensics Research: The Next 10 Years.

If you aren’t enrolled and want to be, make sure you talk to me before you leave.