COMPSCI 590K: Advanced Digital Forenics Systems | Spring 2020

20: Cloud Forensics

It’s 2020, so what discussion of any topic would be complete without asking how we can make it cloud?

Today we’re going to talk about cloud forensics.

What we are not talking much about

There’s (at least) two ways to interpret “cloud forensics.”

The first is “performing forensic tasks in the cloud,” in other words, using “cloud” the way that it’s used in virtually all other contexts. You can “spin up” virtual machines to do tasks, and forensic tasks are included.

(A brief overview of the cloud. In, short, it’s a fancy way of saying you’re renting remote servers. There are some (very important) differences that have to do with dynamic provisioning, efficient networking, virtualization, and so forth – very hard research questions both of CS topics and more general IT / management, but they’re not super material.)

Of course you need to consider bandwidth and costs and whatnot – at scale, it costs significant dollars to move data into and out of various cloud providers like Amazon’s EC2, Google Cloud, and/or Azure. You can work to minimize these costs in various ways. For example, a system that hashes blocks of disk images and only uploads unique blocks might save you significant bandwidth.

But at its core, this is a scaling/engineering problem that’s not unique to forensics. Thus far it doesn’t appear that there’s any significant and unique challenge here that isn’t addressed more generally in distributed systems methods, so we’re not going to dwell on it.

You can read various recent papers / presentations about systems that do forensics in the clouds (SCARF and Turbinia, DFRWS and OSDFCon) if you want to know more.

You can also look at various forensics vendors websites (for example, Elcomsoft), where they advertise products that are clearly versions of cloud-based GPU-enhanced password crackers (claims include: “5-200x faster than CPU”).

Forensics on cloud data

So then, the real question is asking what you can do with user data that’s “in the cloud.” Can you forensically analyze it (or analyze it at all)? Once again, there are several broad classes of data to consider.

Explicitly-created data on cloud machines

So, the potentially easy case first.

If Joe User is renting EC2 instances or the like, they are essentially controlling a remote server. So, with provider help, an investigator can do all of the usual forensic things: disk imaging, memory capture, and so on. In somes ways, it’s worse for Joe (and better for the investigator), since virtualized machines (depending upon the virtualization stack) make it even easier to capture disk images and make it trivial to capture memory as compared to a standalone server. Rented bare metal servers (or rack space) can be more robust in this regard, but generally when you delegate physical security to a third party, you are accepting that legal acquisition might happen. I mean, it can happen to your home computer, too, and a computer (physical or virtual) secured by someone else means it’s kinda out of your hands.

So, to do forensics on the cloud in this sense, there’s no special magic: An investigator with a warrant goes to Amazon or whatnot, and requests the data they want; Amazon typically assists in providing it, and this is considered the start of the chain of custody. In particularly sensitive cases a LE agent might supervise the acquisition directly, but obviously Amazon is best suited to do it. (Same for all cloud providers.)

We haven’t talked much about memory forensics yet – we will soon – but because acquiring memory dumps from virtual machines is so easy, many common technologies that are effectively “anti-forensics” are much more easily foiled. In particular, things like full-disk encryption are pretty easily foiled. This is not surprising if you think about it: FDE is designed to protect data at rest (that is, as it is stored on a computer). But of course the running OS has to have access to the key, otherwise it would not be able to read/write data to disk.

Explicitly-created data on cloud services

What about something not quite as down-and-dirty? Like rather than SSH-ing into your instances, you instead use a service that stores data in the cloud, like DropBox?

Here things get a little fuzzier. It is possible to subpoena Dropbox (or perform a warranted search) but generally their distributed filesystems are not amenable to being “seized” in the ordinary sense. You can’t take a whole data center (practicality aside, courts wouldn’t view this as reasonable, as you would be seizing data belonging to many unrelated parties).

So you can do some version of the above, where you ask Dropbox for the data and they provide it.

You might also directly connect to DropBox. They have an API you can use to request file data and metadata, and there are programming interfaces available for it (see Cloudian, for example). From this you can programmatically pull account metadata, and dropbox-level filesystem data and metadata. You can do the same for Google Docs and the like.

In these cases, there are also artifacts left on local drives. Obviously for file mirroring systems the files themselves might be on the local drive, but Dropbox and programs like it leave extensive artifacts (about sync state and the like). Even Google docs leave things locally, though the formats are odd (artifacts appear to be a list of deltas, though this is all reverse-engineered. See Roussev and others.)

Implicitly-created data

This is the most interesting case!

A shockingly large amount of your data is, with very little configuration from you, being automatically copied to the cloud and synced across your devices. This is especially true if you / your devices sign into or otherwise participate in the Apple iCloud ecosystem (which is heavily encouraged) or Google’s ecosystem (likewise).

Let’s look at Apple. Vladimir Katalov gave a great presentation at DFRWS in 2018 that I’m going to cite liberally for the rest of today. (see:

His talk was motivated by the difficulty of doing smartphone forensics. It’s increasingly difficult to get data off of phones (especially without owner cooperation), but the fact they send it all to a third party means you might be able to get (some of) it that way.

In short, lots of stuff goes to the cloud: call logs, texts, emails, chats, wi-fi names and passwords, web history and passwords, documents, settings, pictures, videos, location history, routes, etc. Some of this stuff is synced more or less in real time!

Occasional backups of your devices go, though often vendors are careful to remove or encrypt sensitive information.

Cloud acquisition helps bypass various protections on the physical device: no need for device passcode, circumvents whole-disk encryption (though data in cloud might be encrypted), potential access to system-wide keychain; no problem if device is broken or lost.

Apple takes great care to make data not acquirable without owner consent (remember San Bernadino case)? But there are options. For example, if you have access to another of the user’s devices, you may be able to get an auth token off of it. Or, if the user set up a recovery key, that will grant you access.

Related story for Google: Google Cloud lets you have a single point of entry for the highly-fragmented ecosystem, and presents a single interface.

Same deal: if you can get access to a device that was signed in, you can access (most of) the rest of their account. (See last slide of Katalov’s talk for various ways to get access to accounts.)