Filesystems: practical demonstration
I got a few questions about filesystem stuff after class, and I anticipate a few more once the next homework goes up later today or early tomorrow. So we’re going to start with a few examples and a walkthrough of how FAT works when you add and remove files.
The plan here is to show you what happens on disk when files are created and removed, viewing both the raw bytes in a hex editor and through the slightly more refined interface of The Sleuth Kit.
I’ll be using a Linux VM on my Mac to do this – you could install your own (I use a combination of VirtualBox and Vagrant here, but any virtualization system would work) if you wanted to reproduce some of these steps.
I’m not going to do it for NTFS. FAT is fairly straightforward to show – partly because of its lineage from a time when simplicity was a virtue. NTFS has many more small details; they are handle-able, but not during a quick class demo.
Drives are files in Unix, so why not have a file be a drive? Let’s make an empty one:
dd if=/dev/zero of=fat.dd bs=1M count=10
Now let’s turn it into a FAT filesystem:
And let’s see what
fstat have to say.
How about hexdump?
dd if=fat.dd bs=512 count=1|hexdump -Cv
Notice how we can see the values are correct (per Carrier).
And what’s on the disk?
fls says nothing (just virtual entries). And if we look at the FAT sectors we see that’s mostly true – note that the entries for cluster 0 and 1 are not actually for clusters on disk (their numbering starts at 2) – these are instead status flags for the drive.
Now let’s put a file into the disk. We have to mount it first:
sudo mount fat.dd /mnt/fat/
and check for it:
mount |grep fat /vagrant/590f/fat.dd on /mnt/fat type vfat (rw,relatime,fmask=0022,dmask=0022,codepage=437,iocharset=iso8859-1,shortname=mixed,errors=remount-ro)
Now let’s put a one-cluster file onto the disk, composed of random junk:
sudo dd if=/dev/urandom of=/mnt/fat/ONE.DAT bs=512 count=4
We can see it’s on the disk now using fls, fsstat, and istat. We can also see the new entries in the hexes of the appropriate areas:
dd if=fat.dd bs=512 count=1 skip=41|hexdump -Cv
Compare w/ Carrier to see the entries are what we expect. Also look at the FAT:
dd if=fat.dd bs=512 count=1 skip=1|hexdump -Cv
Now let’s make a two-cluster file:
sudo dd if=/dev/urandom of=/mnt/fat/ONE.DAT bs=512 count=8
Now we can see it in fls/fsstat/istat; we can also see the “cluster chain” in the FAT.
Now let’s remove
What happens to the directory entry? A one byte change! And the data is still on disk.
Now let’s add three-cluster file; we can see that the FAT driver in Linux chose not to fragment, and instead put the file into a place where it would fit contiguously (right after
TWO). We can also see it overwrote the directory entry for
ONE which is now no longer visible (or trivially recoverable); though the data actually remains in the cluster (as
THREE got put somewhere else.)
A point to take away is that if you want to do file recovery, you can run tools to do it. And, if you want to see how a filesystem’s implementation actually works, you can do so – it’s not a mystery. Go read Carrier (or wikipedia) to see the structures on disk, then you can actually do this yourself to see what’s on disk and what’s not.
Where it gets harder is when documentation is hard to come by, or when the on-disk data structures are harder to understand (due to optimizations, and/or just plain-old complexity).
Next we’re going to do an overview of the Linux ExtN filesystem(s) family.
Ext2 was an update / revision of UFS, the old (reliable) Unix File System. Ext3 added journaling to Ext2 and changed a few of its behaviors in ways that weren’t super user-visible – they improved reliabilty (through use of a journal), and changed some forensic details (some structures are now fully zeroed rather than left on disk, but OTOH the journal is also on disk). Ext4 added support for larger files, and introduced extents for file data, which are like blocks, but guaranteed to be contiguous on disk (to improve performance on large spinning metal disks). Ironically perhaps, this is not really necessary any more on SSDs.
Like FAT/NTFS, Ext aggregates sectors on disk. Ext calls them “blocks”.
So the basic data model for data storage in Ext2 is that you have directory entries: directory entries contain filenames and a pointer to an “inode”. Directory entries are just files. The inode contains all the file metadata, as well as pointers to the blocks that contain the file’s data, so-called “content blocks.” inodes are numbered starting from 0; the first 10 are reserved for various purposes. For example, inode 2 always points to the root directory of the filesystem; for ext3, inode 8 typically points to the journal (but can be specified in the superblock).
The partition is divided into “block groups”, each of which is essentially identical. A “superblock” located 1K from the start of the FS and 1K in size contains configuration values for the FS, including the block size, total # blocks, blocks per group, inodes per group, and so on.
Each block group looks about the same. They (optionally) start with a backup superblock, then contain a group descriptor table. The group descriptor table describes the layout of every block group on the filesystem; it tells you the block address of the block bitmap, inode bitmap, inode table, as well as the number of unallocated blocks and inodes in the group, and the number of directories in the group.
Then there’s a block bitmap, managing the allocation status of blocks. # blocks / group is exactly equal to the number of bits in a block.
Then there’s an inode bitmap, manging the allocation status of inodes in this group. This is followed by the inode table itself (each inode is 128 bytes) and blocks for file contents.
Ext2 generally allocate files within a particular block group, using a first-available strategy. The idea is to minimize drive head movement; but this can change from version to version of linux.
Inodes contain basically all file metadata except filename – things like ownership, modes, times, size, and so on.
Inodes have space for 12 block pointers (4 bytes each). If the file fits in 12 blocks, then great, these pointers point to those blocks. If not, then the rest of the blocks are stored indirectly – there’s a pointer to an “indirect” block, which containes a list of block pointers. If that’s not enough space, then there’s a “double indirect” block, which contains a list of pointers to a list of indirect blocks. And yes, there are triple indirect blocks (though that’s as far as it goes).
Inodes for files are usually allocated in the same block group as their parent directory (again, to minimize drive head movement). Inodes for new directories are typically placed elsewhere algorithmically (depending upon fs version, maybe balanced arithmetically, or by hash function) to spread load.
Directory entries contain a name and inode. By default, they also contain a pointer to the next entry; unused entries (like, if they are deleted) are skipped by changing the pointer of the previous entry. This makes finding deleted (but not yet overwritten) metadata straightforward. Alternatively, they can be organized in other ways, for example, a tree-like structure similar to NTFS, but stored in a backward-compatible way with the list structure (see Carrier for details) – it’s a B+tree stored linearly, where the hash of the filenames rather than the filenames themselves are the keys of the nodes.
Journals are used to make a FS more robust after a crash or unexpected reboot. They can also improve performance in some cases.
In short, they first commit user data to a journal, and then “atomically” (for some value of atomically) update the actual file system based upon the data in the journal. Once the updates are written to the actual FS, the journal entry is marked as committed. If on boot there are uncommitted journal entries, they are first replayed onto the FS.
Ext3 does journaling at the block level – if even a single bit of a block (or inode, or whatever) changes, then the entire block is committed to the journal then to disk. A “descriptor sequence” block is written to the journal, followed by a sequence of queued block modifications (repeat as needed). When a journal transaction is the committed to the FS, the commit is marked as complete by writing a “commit sequence” block to the journal. That space is now available to be overwritten by later commits. The journal structure is a rolling log, so recent older commits will be there; the journal wraps around to overwrite the oldest blocks only after it fills its allocated space.
A couple notes: First, by default, only filesystem metadata (inodes, indirect blocks, etc. – everything except data blocks) are journaled. Data blocks are only journaled if that option is set when the filesystem is mounted. Second, the journal is overwritten from the start when the filesystem is mounted (and it’s not large), so it’s only a small window into the past (that said, it’s a pretty accurate one). Third, the OS can “revoke” journaled changes rather than commiting them to disk (if appropriate, for example, if a file is created and then deleted before the journal is flushed). Finally, the way a journal helps provide filesystem robustness is that when the filesystem is mounted, it is checked for described but not committed blocks; these transactions can then be replayed if it makes sense to do so.