13: Introduction to NTFS

Announcements

Reminder: A08 due Friday. Gradescope item will be up soon.

Prof. Brian Levine will be lecturing on Thursday.

NTFS: An overview

NTFS is the default filesystem on all modern versions of Windows; thus, like FAT, it's in wide use. Unlike FAT, it's in use as the filesystem that most home and business computers run on.

In some ways NTFS is simpler than FAT: it was designed from the ground up to be extensible, and so its design is more principled (and free of the legacy cruft that encumbers, say, the FAT boot sector and its FAT12/16/32 nonsense).

That's the good news. The bad news is that the extensibility is not theoretical. Generic on-disk data structures wrap more specific data structures, so that the internals can be updated over time. To understand NTFS, we'll need to cover both the generic and specific data structures (though not all of them), and that means there's a lot of details to keep straight in your head. I'll do my best in lecture, but you are almost certainly going to need to read and re-read Carrier as well to understand this material.

One core concept in NTFS is simple: Everything is (or is stored in) a file. Regular files, directories, the structures that control the filesystem's layout on disk (like the FAT from FAT16) -- all are either files or stored in files. There's no separate plane of existence for filesystem metadata (like inodes in a UNIX-y filesystem or the FAT + directory entries in FAT16).

The next core concept in NTFS is that the above is a bit of a cheat. Most (but not all) of what we think of as filesystem metadata is stored in one particular data structure: the Master File Table (MFT), which is stored as a file but is otherwise very much analogous to the FATs and dirents, as we'll see. We're going to spend a lot of time today and next class talking about the MFT and how it relates to the files stored on disk.

The final high-level thing you need to know about NTFS is that it breaks a disk up into allocatable units called clusters, just like FAT. Just like FAT, clusters are sized as a power-of-two-multiple of the underlying disk sector size. Unlike FAT, though, cluster 0 starts at the beginning of the partition, so there's none of the "first cluster is cluster number 2" nonsense to contend with.

Finding the MFT

The Master File Table (MFT) contains information about all files and directories in its NTFS. Each file/directory has an MFT entry; the table is just a linear array of MFT entries, numbered with a file number, starting from 0.

How do we find the MFT? Just like in FAT, the first sector of the volume contains a boot sector, and that boot sector encodes the minimal information necessary to understand and parse the volume, including the bytes per sector, sectors per cluster, cluster address of the MFT, and the MFT entry size.

Let's look at an example. Download simple.ntfs and follow along with Table 13.18 on page 380 in Carrier.

The bytes per sector are stored in bytes 11–12. Here it's 512.

The sectors per cluster are stored in byte 13. Here it's 8, so clusters are 8 * 512 B = 4 KB clusters.

The cluster address of the MFT is stored in bytes 48--55. Here, it's 4.

The size of the file record is at byte 64. It (and the size of the index record, at byte 68) is stored in a special format. If, when interpreted as a signed byte it's positive, then it's the number of clusters used for that record. If it's negative, than 2^(abs(value)) bytes are used. Here it's -10, which means that file records are 1KB each (this is the default value).

Compare this with index records, the size of which are stored in byte 68. Here, it's 1, which means index records are 4KB (one cluster) long.

So now we can go and find the start of the MFT in the volume. It's at cluster 4. Cluster 4 is 4 * 4,096 bytes into the file, at offset 0x4000.

Let's double check against the output of fsstat to see that we're doing this correctly.

fsstat simple.ntfs
# ... output follows ...

What's in the MFT?

The MFT is just a sequence of entries. The first 16 are reserved by MS for filesystem metadata information, but in practice it's the first 24 that are reserved. Table 11.1 shows the contents of the reserved entries.

Entry 0 is an entry for the MFT itself. We need this, because although the boot sector tells us where the MFT starts, it (the MFT) might run across multiple clusters. This entry tells us where to find the rest of the MFT!

Entry 3 is the $Volume information; entry 6 is for the $Bitmap (similar to the FAT, but it only tracks allocation, not runs); entry 7 is for the $Boot sector, and so on.

We're going to look at one shortly. But before we do, let's talk a little about the general structure of an MFT entry.

It starts with an MFT entry header, described in detail in Table 13.1. Then there's a sequence of attribute (header, content) pairs, with (usually) some unused space at the end of the entry.

The attribute header identifies the attribute type, size, and name, among other things.

The attribute contents can have any format and any size: one obvious use is to store the contents of a file corresponding to the entry. Small attribute contents can fit in the MFT entry (one systems consequence is that small enough -- roughly, under 700 B -- files don't automatically waste tons of space, as they do in FAT). These are called resident attributes generally, whether they store files or just other attribute content.

Larger attributes (again, might be files, might be other things) might not fit in the entry; these are called non-resident. Non-resident attributes are stored in clusters. The clusters are identified by runlists. Runlists are just lists of runs of contiguous clusters that hold the file. See Figure 11.6.

We know that the MFT is at 16K into this volume; let's use some UNIXy tools to pull out the first entry so that we can see offsets from zero in this entry:

dd if=simple.ntfs of=zeroth-mft-entry bs=1024 count=1 skip=16

...and take a look at it. See Table 13.1 on page 353.

It starts with the four byte sequence corresponding to ASCII "FILE" (or "BAAD" if there's an error on the disk in this entry). Let's find the first attribute. Its offset (from the start of the entry) is stored in bytes 20--21. Here, its value is 56.

Attributes: headers and contents

Let's skip to the attribute, which starts with a header in a standard format. See Tables 13.2--13.4. The first 16 bytes are the same in resident and non-resident attribute headers; after that they diverge.

The header starts with a four-byte type tag. Here, it's 16, which is a $STANDARD_INFORMATION header. This and $FILE_NAME (48) are two attributes that nearly every entry will have.

The next four bytes (4--7) tell us the length. Here it's 96; this means the next attribute starts at offset 56 + 96 = 152 from the start of the current MFT entry.

Byte 8 (offset from 56, remember, so byte 64 in the dump) tells us if the attribute is non-resident. Here it's zero, so this attribute is resident -- that is, it's embedded in the MFT entry.

(Discussion names for standard attributes? E.g., as ADS?)

Let's jump ahead to bytes 16--19 (the size, here 72) and bytes 20--21 (the offset, here 24) of this attribute. (Sanity check: 72 + 24 = 96, which is the size of the attribute.)

So you get the idea; Tables 13.5 and 13.6 tell you how to parse the $STANDARD_INFORMATION attribute's contents.

Let's look at the next attribute. We find it by skipping past the previous one. The previous one started at offset 56 and had a length of 96, so we need to skip ahed to byte 56 + 96 = 152 of the entry.

This one has type 48 ($FILE_NAME). Let's dig into this attribute. We see it's 104 bytes long, so let's pull it out using dd so we can see offsets from 0:

dd if=zeroth-mft-entry of=file_name_attribute bs=1 skip=152 count=104

The first four bytes tell us the type (48); the next four are the length (104 bytes). This attribute is also non-resident. Let's skip ahead to bytes 20--21, which are the offset (from the start of the attribute) to the content. Here, it's 24, so let's go there.

The $FILE_NAME attribute is described in Table 13.7 on page 362. I'm going to use Hex Fiend to do the extraction here, but it's the same as using dd previously, or as slicing a sequence of bytes or seek()ing in Python.

The first eight bytes are the file reference of the parent directory. File references are composed of two parts: the file number and the sequence number. The file number we've already seen: it's the index into the MFT to get to this entry, starting from zero. The sequence number is incremented each time an MFT entry is allocated to use. The two numbers are concatenated, with the 16-bit sequence number in the higher-order bytes (little endian), and the 48-bit file number in the lower-order bytes (little endian), to form a 64-bit file reference number. Note that like all values, it's stored little endian, so the final format is:

FF FF FF FF FF FF SS SS

where FN are file number bytes and SS are sequence number bytes.

So in this $FILE_NAME, the file number is 5. 5 is one of the reserved slots in the MFT, which if we look up in Table 11.1, we see is the root directory.

The next sequence of 8 bytes is the file creation time. NTFS stores file-related times as the number of blocks of 100 ns since January 1st, 1601 UTC.

Here, the value is 8064E1E7 8BA1D201 so we can get the number in Python using:

import struct
timestamp = struct.unpack('<Q', bytes.fromhex('8064E1E7 8BA1D201'))[0]

To convert that to a time, we convert to a UNIX-style epoch (the number of seconds since January 1, 1970). This time is 116444736000000000 100 ns blocks since January 1st, 1601 UTC. So to convert, we can write:

import datetime
def as_datetime(windows_timestamp):
    return datetime.datetime.fromtimestamp((windows_timestamp - 116444736000000000) / 10000000)
print(str(as_datetime(timestamp)))

Bytes 16--23 are the modification time; 24--31 are the MFT modification time; 32--39 are the last access time.

Bytes 40--47 are the allocated file size and bytes 48--55 are the actual size, but these are not required to be accurate unless this attribute is used in a directory index.

Bytes 56--59 are flags, just like in FAT (see Table 13.6).

Byte 64 is the length of the filename (4). Byte 65 is the namespace (See Table 13.8.; here it's the Windows/DOS namespace). Bytes 66 onward are the name, in this case in UTF-16 (LE): $MFT. This is the $MFT entry, just like we expected for entry 0.

Note we can check all this using istat on the relevant entry:

istat simple.ntfs 0-128-1

One last attribute to look at here, the $DATA attribute, which is next. Going back to the zeroth-mft-entry, it starts at offset 256. The first four bytes (value: 128) tell us it's the $DATA attribute, and the next four tell us its length (72). The next byte tells us that this is a non-resident attribute, so its contents are stored somewhere in cluster(s) on the disk.

Let's jump ahead to figure out where (that is, which clusters) the data is stored in.

Looking at Table 13.4, bytes 16--23 and 24--31 are used to tell us the starting and ending VCN of the runlist. The VCN is just a sequence of numbers 0..n-1 referring to the n clusters in a file in order. This is in contrast to the LCN, which is a list of the actual cluster numbers (on disk) that correspond to the VCN clusters. (on board)

Why does the non-resident header have this marker? Because for very long, fragmented files, you might not be able to fit the runlist into a single MFT entry and might need to split it into several and want to be able to parse them independently.

Then there's an offset to the runlist at bytes 32-33. The runlist is in the following format. First there's a single byte that describes the length and offset of the next run; then there's a variable number of bytes describing the length of the run, and a variable number of bytes describing the offset to the run.

The first byte is split into two nibbles (4-bit values). The low-order bits tell you the number of bytes in the run length; the high-order bits tell you the numer of bytes in the offset to the run.

These values stored in the length and offset are in units of clusters, not bytes or sectors.

And, the while the length is what you'd expect, the offset is relative to the previous offset in the runlist (the first offset is relative to the start of the filesystem). Let's look at some data:

111304

So, byte 1 contains 11, which in binary is

0001 0001

This run is described by a single byte offset and length. The length comes first: 13, so it's 19 clusters long. The offset comes next: 04, so it starts 4 clusters past the start of the file system. Which is what we expect, again as shown by istat.

More next class.