10: FS Demo, and Ext[n]

Filesystems: practical demonstration

I got a few questions about filesystem stuff after class and on Piazza. So we’re going to start with a few examples and a walkthrough of how FAT works when you add and remove files.

The plan here is to show you what happens on disk when files are created and removed, viewing both the raw bytes in a hex editor and through the slightly more refined interface of The Sleuth Kit.

I’ll be using a Linux VM on my Mac to do this – you could install your own (I use a combination of VirtualBox and Vagrant here, but any virtualization system would work) if you wanted to reproduce some of these steps.

I’m not going to do it for NTFS. FAT is fairly straightforward to show – partly because of its lineage from a time when simplicity was a virtue. NTFS has many more small details; they are handle-able, but not during a quick class demo.

Drives are files in Unix, so why not have a file be a drive? Let’s make an empty one:

dd if=/dev/zero of=fat.dd bs=1M count=10

Now let’s turn it into a FAT filesystem:

mkfs.fat fat.dd 

And let’s see what file and fstat have to say.

How about hexdump? dd if=fat.dd bs=512 count=1|hexdump -Cv

Notice how we can see the values are correct (per Carrier).

And what’s on the disk? fls says nothing (just virtual entries). And if we look at the FAT sectors we see that’s mostly true – note that the entries for cluster 0 and 1 are not actually for clusters on disk (their numbering starts at 2) – these are instead status flags for the drive.

Now let’s put a file into the disk. We have to mount it first:

sudo mount fat.dd /mnt/fat/

and check for it:

 mount |grep fat
/vagrant/590f/fat.dd on /mnt/fat type vfat (rw,relatime,fmask=0022,dmask=0022,codepage=437,iocharset=iso8859-1,shortname=mixed,errors=remount-ro)

Now let’s put a one-cluster file onto the disk, composed of random junk:

sudo dd if=/dev/urandom of=/mnt/fat/ONE.DAT bs=512 count=4

We can see it’s on the disk now using fls, fsstat, and istat. We can also see the new entries in the hexes of the appropriate areas:

dd if=fat.dd bs=512 count=1 skip=41|hexdump -Cv

Compare w/ Carrier to see the entries are what we expect. Also look at the FAT:

dd if=fat.dd bs=512 count=1 skip=1|hexdump -Cv

Now let’s make a two-cluster file:

sudo dd if=/dev/urandom of=/mnt/fat/ONE.DAT bs=512 count=8

Now we can see it in fls/fsstat/istat; we can also see the “cluster chain” in the FAT.

Now let’s remove ONE.

What happens to the directory entry? A one byte change! And the data is still on disk.

Now let’s add three-cluster file; we can see that the FAT driver in Linux chose not to fragment, and instead put the file into a place where it would fit contiguously (right after TWO). We can also see it overwrote the directory entry for ONE which is now no longer visible (or trivially recoverable); though the data actually remains in the cluster (as THREE got put somewhere else.)

A point to take away is that if you want to do file recovery, you can run tools to do it. And, if you want to see how a filesystem’s implementation actually works, you can do so – it’s not a mystery. Go read Carrier (or wikipedia) to see the structures on disk, then you can actually do this yourself to see what’s on disk and what’s not.

Where it gets harder is when documentation is hard to come by, or when the on-disk data structures are harder to understand (due to optimizations, and/or just plain-old complexity).

Ext2/3/4

Next we’re going to do an overview of the Linux ExtN filesystem(s) family.

Ext2 was an update / revision of UFS, the old (reliable) Unix File System. Ext3 added journaling to Ext2 and changed a few of its behaviors in ways that weren’t super user-visible – they improved reliabilty (through use of a journal), and changed some forensic details (some structures are now fully zeroed rather than left on disk, but OTOH the journal is also on disk). Ext4 added support for larger files, and introduced extents for file data, which are like blocks, but guaranteed to be contiguous on disk (to improve performance on large spinning metal disks). Ironically perhaps, this is not really necessary any more on SSDs.

Like FAT/NTFS, Ext aggregates sectors on disk. Ext calls them “blocks”.

So the basic data model for data storage in Ext2 is that you have directory entries: directory entries contain filenames and a pointer to an “inode”. Directory entries are just files. The inode contains all the file metadata, as well as pointers to the blocks that contain the file’s data, so-called “content blocks.”

The partition is divided into “block groups”, each of which is essentially identical. A “superblock” located 1K from the start of the FS and 1K in size contains configuration values for the FS, including the block size, total # blocks, blocks per group, inodes per group, and so on.

Each block group looks about the same. They (optionally) start with a backup superblock, then contain a group descriptor table. The group descriptor table describes the layout of every block group on the filesystem.

Then there’s a block bitmap, managing the allocation status of blocks. # blocks / group is exactly equal to the number of bits in a block.

Then there’s an inode bitmap, manging the allocation status of inodes in this group. This is followed by the inode table itself (each inode is 128 bytes) and blocks for file contents.

inodes

Ext2 generally allocate files within a particular block group, using a first-available strategy. The idea is to minimize drive head movement; but this can change from version to version of linux.

Inodes have space for 12 block pointers (4 bytes each). If the file fits in 12 blocks, then great, these pointers point to those blocks. If not, then these 12 pointers can each point to an “indirect” block, each of which is just a list of pointers. This indirection can nested if need be.

Inodes for files are usually allocated in the same block group as their parent directory (again, to minimize drive head movement). Inodes for new directories are typically placed elsewhere algorithmically (via hash fn) to spread load.

directory entries

Directory entries contain a name and inode. By default, they also contain a pointer to the next entry; unused entries (like, if they are deleted) are skipped by changing the pointer of the previous entry. This makes finding deleted (but not yet overwritten) metadata straightforward Alternatively, they can be organized in other ways, for example, a tree-like structure similar to NTFS, but stored in a backward-compatible way with the list structure (see Carrier for details).

Ext3 journal

Journals are used to make a FS more robust after a crash or unexpected reboot. They can also improve performance in some cases.

In short, they first commit user data to a journal, and then “atomically” (for some value of atomically) update the actual file system based upon the data in the journal. Once the updates are written to the actual FS, the journal entry is marked as committed. If on boot there are uncommitted journal entries, they are first replayed onto the FS.

Ext3 does journaling at the block level – if even a single bit of a block (or inode, or whatever) changes, then the entire block is committed to the journal then to disk. A “descriptor sequence” block is written to the journal, followed by a sequence of queued block modifications (repeat as needed). When a journal transaction is the committed to the FS, the commit is marked as complete and that space is now available to be overwritten by later commits. The journal structure is a rolling log, so recent older commits will be there.