13: Introduction to NTFS

DRAFT

Announcements

Midterm almost graded.

A08 due Friday, lots of good questions (and less good ones) on Piazza and in office hours. A09 due next week on Wednesday.

NTFS: An overview

NTFS is the default filesystem on all modern versions of Windows; thus, like FAT, it’s in wide use. Unlike FAT, it’s in use as the filesystem that most home and business computers run on.

In some ways NTFS is simpler than FAT: it was designed from the ground up to be extensible, and so its design is more principled (and free of the legacy cruft that encumbers, say, the FAT boot sector and its FAT12/16/32 nonsense).

That’s the good news. The bad news is that the extensibility is not theoretical. Generic on-disk data structures wrap more specific data structures, so that the internals can be updated over time. To understand NTFS, we’ll need to cover both the generic and specific data structures (though not all of them), and that means there’s a lot of details to keep straight in your head. I’ll do my best in lecture, but you are almost certainly going to need to read and re-read Carrier as well to understand this material.

One core concept in NTFS is simple: Everything is (or is stored in) a file. Regular files, directories, the structures that control the filesystem’s layout on disk (like the FAT from FAT16) – all are either files or stored in files. There’s no separate plane of existence for filesystem metadata (like inodes in a UNIX-y filesystem or the FAT + directory entries in FAT16). Certain special files have special names (like what we think of as the “boot sector” from FAT is called $Boot in NTFS), but they’re all just considered files by the file system.

The next core concept in NTFS is that the above is a bit of a cheat. Most (but not all) of what we think of as filesystem metadata is stored in one particular data structure: the Master File Table (MFT), which is stored as a file (named $MFT of course) but contents-wise is analogous to the FATs and dirents, as we’ll see. We’re going to spend a lot of time today and next week talking about the MFT and how it relates to the files stored on disk.

The final high-level thing you need to know about NTFS is that it breaks a disk up into allocatable units called clusters, just like FAT. Just like FAT, clusters are sized as a power-of-two-multiple of the underlying disk sector size. Unlike FAT, though, cluster 0 starts at the beginning of the partition, so there’s none of the “first cluster is cluster number 2” nonsense to contend with.

Finding the MFT

The Master File Table (MFT) contains information about all files and directories in its NTFS. Each file/directory has an MFT entry; the table is just a linear array of MFT entries, numbered with a file number, starting from 0.

How do we find the MFT? Just like in FAT, the first sector of the volume contains a boot sector (which again, in NTFS is just a file, named $Boot). That boot sector encodes the minimal information necessary to understand and parse the volume, including the bytes per sector, sectors per cluster, cluster address of the MFT, and the MFT entry size. Once you can find the MFT, you need to go there to learn the rest of what you need to know about the volume.

Let’s look at an example. Download simple.ntfs and follow along with Table 13.18 on page 380 in Carrier.

(Note we almost certainly won’t get through all of this today.)

00000000  eb 52 90 4e 54 46 53 20  20 20 20 00 02 08 00 00  |.R.NTFS    .....|
00000010  00 00 00 00 00 f8 00 00  00 00 00 00 00 00 00 00  |................|
00000020  00 00 00 00 80 00 80 00  ff 4f 00 00 00 00 00 00  |.........O......|
00000030  04 00 00 00 00 00 00 00  ff 04 00 00 00 00 00 00  |................|
00000040  f6 00 00 00 01 00 00 00  a4 a4 a1 72 46 d9 dc 42  |...........rF..B|
00000050  00 00 00 00 fa 33 c0 8e  d0 bc 00 7c fb 68 c0 07  |.....3.....|.h..|

The bytes per sector are stored in bytes 11–12. Here it’s 512.

The sectors per cluster are stored in byte 13. Here it’s 8, so clusters are 8 * 512 B = 4 KB clusters.

The cluster address of the MFT is stored in bytes 48–55. Here, it’s 4.

The size of the file record is at byte 64. It (and the size of the index record, at byte 68) is stored in a special format. If, when interpreted as a signed byte it’s positive, then it’s the number of clusters used for that record. If it’s negative, than 2^(abs(value)) bytes are used. Here it’s -10, which means that file records are 1KB each (this is the default value).

Compare this with index records, the size of which are stored in byte 68. Here, it’s 1, which means index records are 4KB (one cluster) long.

So now we can go and find the start of the MFT in the volume. It’s at cluster 4. Cluster 4 is 4 * 4,096 bytes into the file, at offset 0x4000.

Let’s double check against the output of fsstat to see that we’re doing this correctly.

fsstat simple.ntfs
# ... output follows ...

What’s in the MFT?

The MFT is just a sequence of entries. The first 16 are reserved by MS for filesystem metadata information, but in practice it’s the first 24 that are reserved. Table 11.1 shows the contents of the reserved entries.

Entry 0 is an entry for the MFT itself. We need this, because although the boot sector tells us where the MFT starts, it (the MFT) might run across multiple clusters. This entry tells us where to find the rest of the MFT!

Entry 3 is the $Volume information; entry 6 is for the $Bitmap (similar to the FAT, but it only tracks allocation, not runs); entry 7 is for the $Boot sector, and so on.

We’re going to look at one shortly. But before we do, let’s talk a little about the general structure of an MFT entry.

It starts with an MFT entry header, described in detail in Table 13.1. Then there’s a sequence of attribute (header, content) pairs, with (usually) some unused space at the end of the entry.

The attribute header identifies the attribute type, size, and name, among other things.

The attribute contents can have any format and any size: one perhaps obvious use is to store the contents of a file corresponding to the entry. Small attribute contents can fit in the MFT entry (one systems consequence is that small enough – roughly, under 700 B – files don’t automatically waste tons of space, as they do in FAT, since they don’t live in a cluster). These are called resident attributes generally, whether they store files or just other attribute content.

Larger attributes (again, might be files, might be other things) might not fit in the entry; these are called non-resident. Non-resident attributes are stored in clusters. The clusters are identified by runlists. Runlists are just lists of runs of contiguous clusters that hold the file. See Figure 11.6.

We know that the MFT is at 16K into this volume; let’s use some UNIXy tools to pull out the first entry so that we can see offsets from zero in this entry:

dd if=simple.ntfs of=zeroth-mft-entry bs=1024 count=1 skip=16

…and take a look at it.

00000000  46 49 4c 45 30 00 03 00  00 00 00 00 00 00 00 00  |FILE0...........|
00000010  01 00 01 00 38 00 01 00  98 01 00 00 00 04 00 00  |....8...........|
00000020  00 00 00 00 00 00 00 00  04 00 00 00 00 00 00 00  |................|
00000030  03 00 00 00 00 00 00 00  10 00 00 00 60 00 00 00  |............`...|
00000040  00 00 18 00 00 00 00 00  48 00 00 00 18 00 00 00  |........H.......|
00000050  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000070  06 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000080  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000090  00 00 00 00 00 00 00 00  30 00 00 00 68 00 00 00  |........0...h...|
000000a0  00 00 18 00 00 00 02 00  4a 00 00 00 18 00 01 00  |........J.......|
000000b0  05 00 00 00 00 00 05 00  80 64 e1 e7 8b a1 d2 01  |.........d......|
000000c0  80 64 e1 e7 8b a1 d2 01  80 64 e1 e7 8b a1 d2 01  |.d.......d......|
000000d0  80 64 e1 e7 8b a1 d2 01  00 70 00 00 00 00 00 00  |.d.......p......|
000000e0  00 6c 00 00 00 00 00 00  06 00 00 00 00 00 00 00  |.l..............|
000000f0  04 03 24 00 4d 00 46 00  54 00 00 00 00 00 00 00  |..$.M.F.T.......|
00000100  80 00 00 00 48 00 00 00  01 00 40 00 00 00 01 00  |....H.....@.....|
00000110  00 00 00 00 00 00 00 00  12 00 00 00 00 00 00 00  |................|
00000120  40 00 00 00 00 00 00 00  00 30 01 00 00 00 00 00  |@........0......|
00000130  00 04 01 00 00 00 00 00  00 04 01 00 00 00 00 00  |................|
00000140  11 13 04 00 00 00 00 00  b0 00 00 00 48 00 00 00  |............H...|
00000150  01 00 40 00 00 00 03 00  00 00 00 00 00 00 00 00  |..@.............|
00000160  00 00 00 00 00 00 00 00  40 00 00 00 00 00 00 00  |........@.......|
00000170  00 10 00 00 00 00 00 00  10 00 00 00 00 00 00 00  |................|
00000180  10 00 00 00 00 00 00 00  11 01 02 00 00 00 00 00  |................|
00000190  ff ff ff ff 00 00 00 00  00 00 00 00 00 00 00 00  |................|
000001a0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
000001f0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 03 00  |................|
00000200  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
000003f0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 03 00  |................|
00000400

See Table 13.1 on page 353.

It starts with the four byte sequence corresponding to ASCII "FILE" (or "BAAD" if there’s an error on the disk in this entry). Let’s find the first attribute – remember, all entries in the MFT consist of an MFT header, followed by attributes (which themselves consist of headers and contents). This first attribute’s offset (from the start of the entry) is stored in bytes 20–21. Here, its value is 56 (0x38).

Attributes: headers and contents

Let’s skip to the attribute, which starts with a header in a standard format. See Tables 13.2–13.4. The first 16 bytes are the same in resident and non-resident attribute headers; after that they diverge.

The header starts with a four-byte type tag. Here, it’s 16, which is a $STANDARD_INFORMATION header. This and $FILE_NAME (48) are two attributes that nearly every entry will have.

The next four bytes (4–7) tell us the length. Here it’s 96; this means the next attribute starts at offset 56 + 96 = 152 (96 from the start of the current attribute’s start).

Byte 8 (offset from 56, remember, so byte 64 in the dump) tells us if the attribute’s content is non-resident. Here it’s zero, so this attribute;s content is resident – that is, it’s embedded in the MFT entry.

(Discussion names for standard attributes? E.g., as ADS?)

Let’s jump ahead to bytes 16–19 (again: offset from 56, so go to 72 0x48) to get the size (72) and bytes 20–21 (offset 76: 0x4c) to get the offset (24) of this attribute’s content. (Sanity check: 72 + 24 = 96, which is the size of the attribute in total and also the offset to the next attribute)

So you get the idea; Tables 13.5 and 13.6 tell you how to parse the $STANDARD_INFORMATION attribute’s contents.

Let’s look at the next attribute for this entry (remember, we’re still looking at the zeroth entry in the MFT). We find it by skipping past the previous one. The previous one started at offset 56 and had a length of 96, so we need to skip ahead to byte 56 + 96 = 152 (0x98) of the entry.

This one has type 48 ($FILE_NAME). Let’s dig into this attribute. We see it’s 104 (0x68) bytes long, so let’s pull it out using dd so we can see offsets from 0 (or just do it in hexfiend):

dd if=zeroth-mft-entry of=file_name_attribute bs=1 skip=152 count=104
00000000  30 00 00 00 68 00 00 00  00 00 18 00 00 00 02 00  |0...h...........|
00000010  4a 00 00 00 18 00 01 00  05 00 00 00 00 00 05 00  |J...............|
00000020  80 64 e1 e7 8b a1 d2 01  80 64 e1 e7 8b a1 d2 01  |.d.......d......|
*
00000040  00 70 00 00 00 00 00 00  00 6c 00 00 00 00 00 00  |.p.......l......|
00000050  06 00 00 00 00 00 00 00  04 03 24 00 4d 00 46 00  |..........$.M.F.|
00000060  54 00 00 00 00 00 00 00                           |T.......|

The first four bytes tell us the type (48); the next four are the length (104 bytes). This attribute is also resident. Let’s skip ahead to bytes 20–21, which are the offset (from the start of the attribute) to the content. Here, it’s 24, so let’s go there.

The $FILE_NAME attribute is described in Table 13.7 on page 362. I’m going to use Hex Fiend to do the extraction here, but it’s the same as using dd previously, or as slicing a sequence of bytes or seek()ing in Python.

dd if=file_name_attribute of=file_name_attribute_content bs=1 skip=24
00000000  05 00 00 00 00 00 05 00  80 64 e1 e7 8b a1 d2 01  |.........d......|
00000010  80 64 e1 e7 8b a1 d2 01  80 64 e1 e7 8b a1 d2 01  |.d.......d......|
00000020  80 64 e1 e7 8b a1 d2 01  00 70 00 00 00 00 00 00  |.d.......p......|
00000030  00 6c 00 00 00 00 00 00  06 00 00 00 00 00 00 00  |.l..............|
00000040  04 03 24 00 4d 00 46 00  54 00 00 00 00 00 00 00  |..$.M.F.T.......|

The first eight bytes are the file reference of the parent directory. File references are composed of two parts: the file number and the sequence number. The file number we’ve already seen: it’s the index into the MFT to get to this entry, starting from zero. The sequence number is incremented each time an MFT entry is allocated to use. The two numbers are concatenated, with the 16-bit sequence number in the higher-order bytes (little endian), and the 48-bit file number in the lower-order bytes (little endian), to form a 64-bit file reference number. Note that like all values, it’s stored little endian, so the final format is:

FF FF FF FF FF FF SS SS

where FN are file number bytes and SS are sequence number bytes.

So in this $FILE_NAME, the file number is 5. 5 is one of the reserved slots in the MFT, which if we look up in Table 11.1, we see is the root directory.

The next sequence of 8 bytes is the file creation time. NTFS stores file-related times as the number of blocks of 100 ns since January 1st, 1601 UTC.

Here, the value is 8064E1E7 8BA1D201 so we can get the number in Python using:

import struct
timestamp = struct.unpack('<Q', bytes.fromhex('8064E1E7 8BA1D201'))[0]

To convert that to a time, we convert to a UNIX-style epoch (the number of seconds since January 1, 1970). This time is 116444736000000000 100 ns blocks since January 1st, 1601 UTC. So to convert, we can write:

import datetime
def as_datetime(windows_timestamp):
    return datetime.datetime.fromtimestamp((windows_timestamp - 116444736000000000) / 10000000)
print(str(as_datetime(timestamp)))

Bytes 16–23 are the modification time; 24–31 are the MFT modification time; 32–39 are the last access time.

Bytes 40–47 are the allocated file size and bytes 48–55 are the actual size, but these are not required to be accurate unless this attribute is used in a directory index.

Bytes 56–59 are flags, just like in FAT (see Table 13.6).

Byte 64 is the length of the filename (4). Byte 65 is the namespace (See Table 13.8.; here it’s the Windows/DOS namespace). Bytes 66 onward are the name, in this case in UTF-16 (LE): $MFT. This is the $MFT entry, just like we expected for entry 0.

Note we can check all this using istat on the relevant entry:

istat simple.ntfs 0-128-1

One last attribute to look at here, the $DATA attribute, which is next. Going back to the zeroth-mft-entry, it starts at offset 256 (from the start of the entry, not the start of the entire MFT). How did we get this? It’s not a magic value: we computed it. Remember, the previous attribute ($FILE_NAME) started at offset 152 and was 54 bytes long. 54+152 = 256.

dd if=zeroth-mft-entry of=data_attribute bs=1 skip=256 count=72
00000000  80 00 00 00 48 00 00 00  01 00 40 00 00 00 01 00  |....H.....@.....|
00000010  00 00 00 00 00 00 00 00  12 00 00 00 00 00 00 00  |................|
00000020  40 00 00 00 00 00 00 00  00 30 01 00 00 00 00 00  |@........0......|
00000030  00 04 01 00 00 00 00 00  00 04 01 00 00 00 00 00  |................|
00000040  11 13 04 00 00 00 00 00                           |........|

The first four bytes (value: 128) tell us it’s the $DATA attribute, and the next four tell us its length (72). The next byte tells us that this is a non-resident attribute, so its contents are stored somewhere in cluster(s) on the disk.

Let’s jump ahead to figure out where (that is, which clusters) the data is stored in.

Looking at Table 13.4, bytes 16–23 and 24–31 are used to tell us the starting and ending VCN of the runlist. The VCN is just a sequence of numbers 0..n-1 referring to the n clusters in a file in order. This is in contrast to the LCN, which is a list of the actual cluster numbers (on disk) that correspond to the VCN clusters. (on board)

Why does the non-resident header have this marker? Because for very long, fragmented files, you might not be able to fit the runlist into a single MFT entry. NTFS then needs to split the attributed across several MFT entries; this is how you figure out “where you are” in the entry.

Then there’s an offset to the runlist (from the start of this attribute) at bytes 32-33. The runlist is in the following format:

First there’s a single byte that describes the length and offset of the next run; then there’s a variable number of bytes describing the length of the run, and a variable number of bytes describing the offset to the run.

In more detail: The first byte is split into two nibbles (4-bit values). The low-order bits tell you the number of bytes in the run length; the high-order bits tell you the number of bytes in the offset to the run.

These values stored in the length and offset are in units of clusters, not bytes or sectors, and the offset bytes are signed.

And, the while the length is what you’d expect, the offset is relative to the previous offset in the runlist (the first offset is relative to the start of the filesystem, that is, cluster 0). Let’s look at some data:

11 13 04

So, byte 1 contains 11, which in binary is

0001 0001

This run is described by a single byte offset and length. The length comes first: 13, so it’s 19 clusters long. The offset comes next: 04, so it starts 4 clusters past the start of the file system. Which is what we expect, again as shown by istat.

istat simple.ntfs 0-128-1

...
Type: $DATA (128-1)   Name: N/A   Non-Resident   size: 66560  init_size: 66560
4 5 6 7 8 9 10 11 
12 13 14 15 16 17 18 19 
20 0 0 
...

Notice there are 19 values there. They start at 4 and then there are 19 of them. The last two are zero. Why? The allocated size of this attribute’s content (bytes 40–47: as an int: 77824) is exactly 19 4kB clusters. But the actual size (bytes 48–55: 66560) fits in 16.25 clusters. So the last two clusters are allocated but not used. istat represents this by showing their numbers as zeroes.

If you continue parsing this entry, you’ll see that there’s one more attribute of type 0xb0, which is a $BITMAP attribute. This is used to track index records in the MFT, but we’ll not concern ourselves with it.

How do you know you are done parsing attributes? The MFT entry is of fixed size, and doesn’t include a “total length” in its header. Instead, you look for the hex value ffff ffff where you’d expect an attribute to begin; that indicates you have finished. In this MFT entry they’re at offset 0x190 from the start of the entry.