11: ZFS

Review: The old way of volume and filesystem management

Since time immemorial (ha-ha), system administrators and even home users followed a particular model. One disk = one volume. The volume is partitioned into one or more file systems; each file system is formatted in a particular way, and mounted or made visible to the user. (examples: Linux Ext; Windows NTFS; USB FAT).

At some point, it became clear it would be helpful to have volumes that spanned multiple disks, for a variety of reasons.

One reason was capacity – sometimes you just needed more space than fit on a disk (or on an affordable disk, anyway). Another was redundancy – if a disk died, it might be good not to have to take your system completely offline and restore from backups.

These multi-disk volumes can be implemented in a variety of ways. One such way is using either hardware or software RAID. RAID abstracts multiple disks into a single logical disk to your OS, which then partitions, etc. this single disk as usual. Under the hood, RAID (especially hot-swappable RAID) can do a variety of interesting things: striping, concatenating, mirroring, parity, and so on. Software implementations of RAID work similary but don’t require specialized hardware; some things not expicitly called RAID like Linux’s Logical Volume Manager can do many of the same things (and some others besides, like snapshotting volumes if space permits, etc.).

One of the problems with this stacked model is that the volume management and the filesystem are relatively oblivious to one another; there are many ways things can be tuned, but not automatically (sysadmins have to twiddle stripe widths, extent sizes, etc., to be reasonably in tune).

In the mid-2000s there was renewed (industrial) interest in filesytem design, and we’ve seen several filesystems under develpment and entering deployment since then that attempt to unify volume management and the filesystem with varying goals, including better scalability and reliability, such as ZFS, btfrs, APFS, and so on.

ZFS

Today we’re going to talk about ZFS, which Sun/Oracle developed originally as a from-scratch successor to the venerable and much-extended UFS that underpinned Solaris.

ZFS is kinda nuts (in a good way) compared to the other FSs you may have seen in classes and that we’ve talked about in this class. It bundles together volume management, and redundancy, and reliability, and the usual job of filesystems, and so on, into one big system. So let’s talk about it at a high level, and then piece-by-piece so you can see the forensic implications.

The first thing to know is that ZFS (or its design, really) generally cares most about reliability; there’s lots of ways it supports redundancy, checksumming, and so on. Next is that it does in fact care about speed, but more, I think, throughput than latency. Though it also cares about latency, particularly if the user sets up a system appropriately (by putting the ZIL – basically, the FS journal, onto high-speed media). Finally, it supports at-the-time very unusual features that were quite forward-thinking: it’s endianness-aware, it supports export and streaming of filesystems, it supports snapshots of filesystems, and so on. More on this as we go.

High-level ZFS concepts

The highest-level concept in ZFS is that of the disk pool (sometimes called a zpool). A zpool consists of one or more disk groups; and a disk group consists of one or more disks.

Disk groups can just be single disks, or they can be operating in various special modes. They can be configured to run in various RAID-like modes (striped, mirrored, striped+mirrored, 1, 2, or 3-disk parity). They can also be configured as hot spares, which the OS can automatically substitute in if a mirrored or parity-ed drive fails. Usually, all drives in the pool are configured in the same-ish way (all part of a RAID device, or all mirrored, or etc.)

Example: Root pool, containing two virtual devices, each composed of several disks.

Each disk is tagged with a vdev label (in quadruplicate – two at the start, and two and the end). This identifies which pool it is part of, as well as containing other important information – any virtual device it is part of, as well as the root (top-level) pool description, is contained in this vdev. Again, each vdev contains all the metadata relevant to its physical device and its parents in the pool hierarchy.

Some OSes have various requirements here (e.g., Solaris requires the OS be on its own pool that must be built in mirrored mode, and user data can go elsewhere). Others don’t care so much, or don’t support bootable ZFS, or various other constraints.

The actual user-visible filesystems are created from the pool. On creation, the admin can specify if they want the FS to default to particular types of storage or not (mirrored, or not, etc.). This is useful if you want, say, user data to be on mirrored or parity drives, but you are OK with other stuff (say, large media files) to not be stored redundantly. You can specify per-fs or per-user quotas on each FS, as well as reserving space from the pool for the FS, just as you’d expect.

ZFS supports hot-swapping of disks and hot failover, which is nice. It also supports adding new disks to a pool without shutting the pool down. When you remove a disk from a mirrored pool, it is configured as an exported drive that can then be imported into another system.

ZFS is a transactional filesystem, which means both metadata and data are written using copy-on-write. (example: first update content, then update metadata, finally update uberblock).

Synchronous writes are journaled in the ZIL, which enforces sequencing.

Copy-on-write also allows for virtually free clones and snapshots, with space increasing only as the delta between the filesystems. (Similar to Time Machine or Shadow Copies.) Most ZFS-using OSes do regular snapshots as a result.

ZFS uses 128 bit pointers for most things, and so has effectively no limit on filesystem size or number of files allocatable.

ZFS checksums everything. Both ZFS metadata and file content data are cryptographically checksummed, and checksums are incidentally re-validated on every disk read. This allows the FS to immediatly detect corruption, and also prevents problems with certain RAID and mirroring configurations – if you have only a HW RAID mirror, you can detect but not repair corruption. But with checksums, you know which copy of the data is valid! Same story with parity drives, especially if there is a failure when writing parity if you have only one parity drive (aka the “write hole” problem). And, checksums are stored in a different block / sector than the data being checksummed.

ZFS also stores most things redundantly. By default, user data has one copy, metadata has two, FS-level data has three. And the copies will end up on separate physical devices if possible.But this is configurable – you can increase redundancy if you want.

ZFS: Some details

Let’s talk a bit about what lives on disk for ZFS, and how that goes from physical disks to filesystems. We’ll start with the vdev. Remember, it lives (in quadruplicate) on each physical disk. Just like every other FS, there has to be something that lives in a place you can find – here, it’s the vdev – and then from that, a way to find everything else.

The vdev has some reserved space at the front, along with some descriptive values of the physical / logical vdev’s it’s part of. Then it has a sequence of uberblocks – these are the highest-level type of data structure in zfs. And that’s it. Other than the copies of the vdevs, the rest of the disk is allocatable storage space for zfs. Only one uberblock is active at a time (they are numbered, highest has priority) and they are updated one-by-one as FS changes occur. The uberblock lets you find everything else in that pool.

In ZFS, everything is an object, and objects are represented by dnodes (similar to inodes in UFS/ExtFS). dnodes contain block pointers, which themselves store information about where objects are on disk and how big they are. dnodes are usually in arrays.

Objects of similar types are grouped together in to object sets, and described in metadnodes. These contain a dnode that usually points to a dnode array that contains the objects, a ZIL header, and the type of object stored in the set. There is a special object set called the MOS (meta object set) that is the superset of all objects in the pool.

So, to find a particular filesystem (work through top of diagram on page S103)

Once you find find the filesystem, you can follow pointers through the filesystem’s objects in object sets / dnodes, similar to following file descriptors in other FSs, to find a particular path, its metadata, and its content on disk. (work through bottom of diagram)

File content is stored in extents / FSBs which are variably sized!

ZFS: Forensic implications:

COW is at FSB level.

Copies in allocated space (ditto blocks); first fit

Snapshots / clones

ZIL

compression (both metadata and data)

dynamically-sized extents