02: Motivating Example

Welcome

Announcements

Again, the most important thing to know today: the course web site is at http://people.cs.umass.edu/~liberato/courses/2018-spring-compsci365+590f/. It has the syllabus for this class and you are expected to read it in its entirety.

Edlab login details (in-class only).

Questions: Feel free to interrupt me. One of the benefits of coming to lecture is you can ask questions. If your question is too specific or off-topic, I’ll answer it after class or by Piazza (which you’ll need to remind me about).

Gradescope: You must consent in order to use it. There will be a Piazza poll posted soon.

If you have an accommadation due to a recognized disability, you need to contact me to tell me if and how you plan to use it. In particular, things like extra time on homeworks need to be requested in advance of, not after, the due date. Or if you need extra time on an exam, I need time to figure out how and where you’ll take it.

Continuing on with Anne Adams

Today we’ll continue our motivating example.

(Thanks to Brian Levine and Clay Shields for this motivating example.)

Recall we are investigating Anne Adams, who may or may not have been responsible for a toy design being taken from Acme to Nadir Toys when she was poached.

In our particular case, our goal is to locate evidence (if it exists) on a USB key data, that demonstrates the toy rabbit was first designed by Adams while she was still employed by Acme.

Later on, we’ll explain the details of how data is imaged from a storage device, including internal hard drives, USB storage, CDROMs, and more. Feel free to download a copy of the acquired image from Ms. Adams’ USB storage device and follow along. In any case, all of the data on Adams’ device is now contained in a file called adams.dd; we don’t need her device (or another) to examine hers once we possess this identical image. In this case, her USB key is only a container for the digital evidence and not the evidence itself. The evidence file contains all the data that was previously on the USB key.

Along with the adams.dd file, we are also provided with a “hash” of the file:

4d56552f2c1125d1298c3f40d8072abc05428e897879056b1e851aa0451b1ad0

so that we can verify we have the same file as the person who acquired it from the device. What’s that?

A brief digression: hashing

How do we know the contents of the image file and the contents of the original USB key are the same? How do we know that our copy is not later corrupted? How can we tell two copies are the same?

It’s always possible to do a full, bit-for-bit comparison of two devices or files. But there are many times when it’s useful to have a small “fingerprint”-like piece of information that uniquely (with some disclaimers) identifies a particular piece of data. To generate such a fingerprint, we use cryptographic hash functions, which generate a hash or digest. A particular hash function will generate a (small) fixed-size output, that is easy for a human to check. This process is sometimes also called checksumming. Checksumming checks against unintentional corruption. Cryptographic hash functions also protect against deliberate, malicious attempts to tamper with the data.

Aside three other reasons cryptographic hash functions are useful to us in this regard:

  • First, it is highly improbable that two pieces of data (inputs) will have the same output. Note that with fixed-size outputs, it’s possible that two different inputs will produce the same hash; but hash functions are typically designed so that it’s unlikely; part of this is that they have a large enough output space (say, 160 bits). Similarly, given a hash output, it’s hard to find an input that generates it.
  • Second, they are reasonable fast for modern computers to compute. (There are some hash functions that are designed to be slower for various reasons, but not the hash functions we’re referring to in this course, like SHA-2 and so on).
  • Finally, as previously mentioned, the output is fixed size and relatively small. The input can be 1 KB or 1TB, but the output is always the same size. For example, SHA-2 produces 256 bit outputs.

Commonly used older hash functions include MD5 (128 bits) and SHA-1 (160 bits); both are known to be potentially vulnerable to a malicious adversary (though some of the attack models are not likely to be realistic for disk images). SHA-2 (256 bits) is not known to be vulnerable to malicious attempts at finding collisions.

Note there are various tools you can use to generate hashes. Here’s what I used on my Mac; these tools were installed using MacPorts:

gmd5sum adams.dd 
48a6e3ebfc375bd2e8a966690bd2d6d2  adams.dd

shasum adams.dd
1701bc76af9ff88335887e217b16c11f7289ecfc  adams.dd

shasum -a 256
4d56552f2c1125d1298c3f40d8072abc05428e897879056b1e851aa0451b1ad0  adams.dd

The Sleuth Kit (tsk)

OK; we know we have the right file. How can we dig into it? Standard command-line tools won’t give us much meaningful information:

> file adams.dd 
adams.dd: DOS/MBR boot sector, code offset 0x3c+2, OEM-ID "BSD  4.4", sectors/cluster 2, root entries 512, sectors 10239 (volumes <=32 MB), sectors/FAT 20, sectors/track 32, heads 16, serial number 0x36c013ef, label: "ADAMS      ", FAT (16 bit)

…though we’ll know how to read most of this by mid-semester.

So we’ll turn to The Sleuth Kit, an open source forensic toolkit written by our textbook author. We’ll use the fls command to get a first look into the image:

fls adams.dd 
r/r 3:  ADAMS       (Volume Label Entry)
d/d 5:  images
r/r 7:  Designs.doc
v/v 163171: $MBR
v/v 163172: $FAT1
v/v 163173: $FAT2
d/d 163174: $OrphanFiles

(An aside: naming conventions in TSK. See: http://wiki.sleuthkit.org/index.php?title=TSK_Tool_Overview but in short, tools are prefixed to tell you what level of abstraction they work on.)

What’s going on here? (See http://wiki.sleuthkit.org/index.php?title=Fls for more details.)

r and d refer to regular files and directories. v refers to virtual entries: frequently, certain artifacts are useful to investigators, and so TSK will give us virtual references to them in various contexts.

Files are stored somehow in a filesystem, and information about each file is stored in the filesystem’s metadata. In most filesystems, there are two parts to the metadata: the file name structure, and the rest of it (usually just called the metadata in TSK).

In the listing above, the first r/d refers to the type stored in the filename metadata; the second refers to the type stored in the general (last modified, location on disk, etc.) metadata. Usually they’re equal; if a file has been deleted and one of the two structures reallocated, they could differ.

The metadata structure about each file is given a unique address by most filesystems; if there is no such address in a filesystem, TSK will create (in a deterministic fashion) an address. That’s the number listed here. This is often referred to as the inode, which is the name of the structure in many Unix-derived filesystems.

Files are stored typically stored in blocks on disk; each file may take up one or more blocks. In most file systems, each block is used by only one file, even if the file is smaller than the block size. In FAT, the volume label gets its own entry (this is part of how TSK abstracts the underlying disk / partition / filesystem structure; we’ll see more on this later).

Let’s say we want to extract the Designs.doc file from this disk image. We can use icat, which is similar to the cat utility, but instead of working from standard in to standard out, it will parse a disk image, look for the file associated with a given inode, and send it to standard out. We can capture this with the shell redirection operation >:

icat adams.dd 7 > Designs.doc

Let’s immediately take a fingerprint:

shasum -a 256 Designs.doc > Designs.doc.sha256

Notice we saved it to a file for later use. other investigators can verify that it is possible to extract the identical data from the image. Additionally, the hash allows the investigation to be verified in a way that is independent of the tools we are using.

filesystems

Your computer is comprised of many systems: file systems, databases, networking, and more. A system is software that presents a service to the user. File systems store and retrieve files for the user; database systems store data so that complex queries can be resolved; network systems transmit and route data across to remote computers. Each system is designed to present a simple view of a complex set of interactions.

For example, the networking system hides a large amount of work that goes into presenting a Web page. Similarly, the file system presents your files to you without the details of how files are actually managed. The operations that are hidden from the user by each system are designed to operate efficiently, and often efficiency means leaving artifacts behind.

This is particularly true when files are deleted because the quickest mechanism for the file system is to simply mark the file as deleted, then to overwrite the file when space is needed later. You might think of this process as similar to what you do when writing with a pen. It is easier to cross out mistakes from sentences in a letter than it is to re-copy the entire letter’s text onto a new sheet of paper, but the old writing might still be legible.

The file system presents to the user a simplified logical view of the file system that doesn’t include all the information available to the file system itself. In fact, the goal of every system is to present a complex service as something simpler via an interface. One analogy to this process is dining at a restaurant. The menu and the waiter presented to the customer are an interface that can be used to order food from the kitchen.

The user’s view of the filesystem:

user view

The interface allows the user to create files, and storage space is allocated accordingly. The user can issue a command to delete the file, and storage space is then unallocated. File modifications are can occur through overwriting the same storage space with new data, finding space for additional data, or deleting the existing file and then creating a new file with the same name. Users have no information or influence regarding where a file is stored on the storage medium – that is a complication handled by the file system. For forensic investigators, the life of data in storage is not so simple.

Here’s the internal life cycle of files, data, and storage on a filesystem, beginning with data given to the file system by some application, such as a word processor.

system view

In Step 1, a file is written to a specific allocated block x and we say that it contains active data. When the user or application issues the command to delete the data, the process moves to Step 2, where the storage block is unallocated. The data remains present in storage but we say it is expired; that is, the data cannot be recovered through the filesystem’s interface to the user but it is retrievable by an investigator with direct access to the disk. Finally, once the same block is allocated to new data, the overwritten portions of the old data are removed and unavailable to the investigator.

Throughout this class, we will see that all systems on a computer have an interface presented to the user or to another system; our job as forensic investigators is to learn about the exact life cycle of data as it is transmitted, stored, or computed by the system.

Let’s now take a closer look at Adams’ files. Recall that there is a subdirectory. To see a list of all files, recursively traversing the entire file system and prepending the full path to each file, issue the fls -r -p command (-r for recursive, -p for show full path).

fls -r -p adams.dd 
r/r 3:  ADAMS       (Volume Label Entry)
d/d 5:  images
r/r * 549:  images/_MG_3027.JPG
r/r 7:  Designs.doc
v/v 163171: $MBR
v/v 163172: $FAT1
v/v 163173: $FAT2
d/d 163174: $OrphanFiles

The command reveals another file, but this time, TSK puts an asterisk next to it, denoting that the file has been deleted. We can extract the file and fingerprint it using the icat and md5 commands, just as we did above. TSK knows the first block of the file was 549; however, because the file is deleted, information about the subsequent blocks in which the file was located could have been disturbed. Include the -r flag to let TSK make its best guess (use ‘r’ecovery techinques) at the complete file.

icat -r adams.dd 549 > image.jpg

We can now view the file.

Locard’s exchange principle

In this case, the alleged crime is property theft, a civil dispute, and the hypothesis was suggested to us by Acme: Adams copied proprietary documents and took them with her to the Nadir Toy Corp.

To build a case, we make use of a foundational principal of criminal investigations.

Locard’s Exchange Principle: Anyone and anything entering a scene can take something of the scene with them, and they can leave something of themselves behind.

When this principle is in effect, it produces two types of evidence:

  1. Class characteristics: traits common in similar items;
  2. Individual characteristics: unique traits that can be linked to a specific person or activity with greater certainty.

The classic example that distinguishes these two categories is shoe prints: the make and model of a shoe is a class characteristic, whereas the scuff marks identify a particular shoe. A file’s extension tells us that the file was probably created by Microsoft Word, a class characteristic; the meta-data stored in the file can tell us information about who created it, an individual characteristic. To overcome the limited view presented by a partial record of events, investigators will seek to present the largest possible set of corroborating evidence that supports a hypothesis; Locard’s principle is instrumental in that goal.

Locard’s principle explains why evidence may be found, but it does not mandate that evidence is always present. It is not impossible to have a crime scene with no evidence at all: Data can be overwritten easily; fire may melt backup storage; firefighters may wash away fingerprints.

Application metadata

The Designs.doc file can be opened with Open Office or Microsoft Word. Opening the file, the contents are the design of the toy that Adams released from her Nadir Toy Corp. Now examine the document properties by selecting the menu option File→Properties.... The document has data that suggests that it was first created when Adams was at Acme Company.

This information is called application metadata. In the Designs.doc file, the metadata represents an individual characteristic that links the file’s creation and use to Acme. The file type — Microsoft Word document — is a class characteristic. As we will see throughout the class, most digital artifacts are leads that help us find stronger evidence or lead us to other witnesses. This metadata does not strongly implicate Adams. We do not know how such information was entered into the document. Perhaps she still uses a version of MS Word that was installed by Acme but then wrote this document after she left the company. The best way to use this information is to look for the same document or features of the document on the files that are contained on Adams’ former desktop computer at Acme and the backed-up archives of that machine.

EXIF data

Photos taken with digital cameras store a wealth of metadata with the image data. One format for storing extra information is the Exchangeable Image File Format. You can reveal the EXIF stored in an image, if any, using a number of tools. Most image viewer programs allow you to view the “properties” of an image. Several free tools are available to parse the EXIF data for you; one example is Phil Harvey’s exiftool, which parses much more than EXIF, it turns out.

The EXIF information can record a time when the photo was taken and modified, a GPS position, and the make and model of the camera. Note again that such information is not sufficient or reliable for making a strong case. First, we do not know if the data is real — it may have been edited by a third party. Or, the data may be real, but the timestamp may not be accurate because the clock referenced may have been set incorrectly. Similarly, we may not know if a GPS location of a photo notes when the picture was taken or when the camera was last powered on outside (GPS signals do not reach the inside of buildings).

What this information does tell us is that it may be a good idea to ask Adams if she has a Canon PowerShot S500. If she does, and it has serial number 4593234, then that is suggestive that her camera took this picture. At the same time, it does not prove Adams operated the camera when the picture was taken. If the camera has a flash card inserted in it when it is located, then we might find this same file on the flash card, which is stronger proof that a camera she owns took the photo. We might also verify that the clock is set correctly or learn details about how the camera handles GPS signals when inside buildings. Finally, a conversation with Adams about the camera will reveal more information: e.g., does she ever lend the camera to anyone? Where is the camera kept and who else has access to it?

In this case, let’s assume that a camera with a matching serial number was found to be the property of Acme Toy Corp. and was in possession of Adams (among others) while she was employed there.

Summing up

Can you form a hypothesis based on the evidence we recovered by applying Locard’s Rule and other logical inferences? Let’s enumerate what we found out in our hypothetical investigation.

  1. Through a subpoena, we recovered Adams’ personal storage device. Let’s assume that Adams confirmed that the device is hers in an interview.
  2. Through an investigation, we discovered Acme’s document on a image of the device. The document contained metadata listing Acme’s name in the document properties. Locard’s rule and these individual class characteristics suggest that Adams took this identifying information from Acme; however, we can’t be sure when she created the document: before or after she left the company.
  3. We recovered a deleted image from the storage device. Metadata in the EXIF portion of the image suggests the photo was taken with a camera with serial number 4593234 owned by Acme.

Take a moment to formulate and write down a hypothesis that explains the events that took place in the case involving Adams. Justify your decisions.

End-of-class reminders

Homework.

Signups for Piazza, Gradescope, etc., if you added after the first class.