14: More on NTFS

Announcements

Should I accept A08 without penalty until Friday?

NTFS: An re-overview

One core concept in NTFS is simple: Everything is (or is stored in) a file. Regular files, directories, the structures that control the filesystem’s layout on disk (like the FAT from FAT16) – all are either files or stored in files. There’s no separate plane of existence for filesystem metadata (like inodes in a UNIX-y filesystem or the FAT + directory entries in FAT16). Certain special files have special names (like what we think of as the “boot sector” from FAT is called $Boot in NTFS), but they’re all just considered files by the file system.

The next core concept in NTFS is that the above is a bit of a cheat. Most (but not all) of what we think of as filesystem metadata is stored in one particular data structure: the Master File Table (MFT), which is stored as a file (named $MFT of course) but contents-wise is analogous to the FATs and dirents, as we’ll see. We’re going to spend a lot of time today and next week talking about the MFT and how it relates to the files stored on disk.

The final high-level thing you need to know about NTFS is that it breaks a disk up into allocatable units called clusters, just like FAT. Just like FAT, clusters are sized as a power-of-two-multiple of the underlying disk sector size. Unlike FAT, though, cluster 0 starts at the beginning of the partition, so there’s none of the “first cluster is cluster number 2” nonsense to contend with.

Boot sector super fast review

The boot sector (a file named $Boot) is always in the zeroth cluster. We can parse it to find the byte offset to the start of the $MFT: (from $BOOT bytes_per_sector, and sectors_per_cluster, MFT_cluster_start). Also we can parse the entry_size, which is almost always 1,024 bytes.

What’s in the MFT?

The MFT is just a sequence of entries. The first 16 are reserved by MS for filesystem metadata information, but in practice it’s the first 24 that are reserved. Table 11.1 shows the contents of the reserved entries.

Entry 0 is an entry for the MFT itself. We need this, because although the boot sector tells us where the MFT starts, it (the MFT) might run across multiple clusters. This entry tells us where to find the rest of the MFT!

Entry 3 is the $Volume information; entry 6 is for the $Bitmap (similar to the FAT, but it only tracks allocation, not runs); entry 7 is for the $Boot sector, and so on.

We’re going to look at one shortly. But before we do, let’s talk a little about the general structure of an MFT entry.

It starts with an MFT entry header (42 bytes), described in detail in Table 13.1. Then there’s a sequence of attribute (header, content) pairs, with (usually) some unused space at the end of the entry.

The attribute header (16 byte) identifies the attribute type, size, and name, among other things.

The attribute contents can have any format and any size: one perhaps obvious use is to store the contents of a file corresponding to the entry. Small attribute contents can fit in the MFT entry (one systems consequence is that small enough – roughly, under 700 B – files don’t automatically waste tons of space, as they do in FAT, since they don’t live in a cluster). These are called resident attributes generally, whether they store files or just other attribute content.

Larger attributes (again, might be files, might be other things) might not fit in the entry; these are called non-resident. Non-resident attributes are stored in clusters. The clusters are identified by runlists. Runlists are just lists of runs of contiguous clusters that hold the file. See Figure 11.6.

We know that the MFT is at 16K into this volume; let’s use some UNIXy tools to pull out the first entry so that we can see offsets from zero in this entry:

dd if=simple.ntfs of=zeroth-mft-entry bs=1024 count=1 skip=16

…and take a look at it.

00000000  46 49 4c 45 30 00 03 00  00 00 00 00 00 00 00 00  |FILE0...........|
00000010  01 00 01 00 38 00 01 00  98 01 00 00 00 04 00 00  |....8...........|
00000020  00 00 00 00 00 00 00 00  04 00 00 00 00 00 00 00  |................|
00000030  03 00 00 00 00 00 00 00  10 00 00 00 60 00 00 00  |............`...|
00000040  00 00 18 00 00 00 00 00  48 00 00 00 18 00 00 00  |........H.......|
00000050  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000070  06 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000080  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000090  00 00 00 00 00 00 00 00  30 00 00 00 68 00 00 00  |........0...h...|
000000a0  00 00 18 00 00 00 02 00  4a 00 00 00 18 00 01 00  |........J.......|
000000b0  05 00 00 00 00 00 05 00  80 64 e1 e7 8b a1 d2 01  |.........d......|
000000c0  80 64 e1 e7 8b a1 d2 01  80 64 e1 e7 8b a1 d2 01  |.d.......d......|
000000d0  80 64 e1 e7 8b a1 d2 01  00 70 00 00 00 00 00 00  |.d.......p......|
000000e0  00 6c 00 00 00 00 00 00  06 00 00 00 00 00 00 00  |.l..............|
000000f0  04 03 24 00 4d 00 46 00  54 00 00 00 00 00 00 00  |..$.M.F.T.......|
00000100  80 00 00 00 48 00 00 00  01 00 40 00 00 00 01 00  |....H.....@.....|
00000110  00 00 00 00 00 00 00 00  12 00 00 00 00 00 00 00  |................|
00000120  40 00 00 00 00 00 00 00  00 30 01 00 00 00 00 00  |@........0......|
00000130  00 04 01 00 00 00 00 00  00 04 01 00 00 00 00 00  |................|
00000140  11 13 04 00 00 00 00 00  b0 00 00 00 48 00 00 00  |............H...|
00000150  01 00 40 00 00 00 03 00  00 00 00 00 00 00 00 00  |..@.............|
00000160  00 00 00 00 00 00 00 00  40 00 00 00 00 00 00 00  |........@.......|
00000170  00 10 00 00 00 00 00 00  10 00 00 00 00 00 00 00  |................|
00000180  10 00 00 00 00 00 00 00  11 01 02 00 00 00 00 00  |................|
00000190  ff ff ff ff 00 00 00 00  00 00 00 00 00 00 00 00  |................|
000001a0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
000001f0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 03 00  |................|
00000200  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
000003f0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 03 00  |................|
00000400

See Table 13.1 on page 353.

It starts with the four byte sequence corresponding to ASCII "FILE" (or "BAAD" if there’s an error on the disk in this entry). Let’s find the first attribute – remember, all entries in the MFT consist of an MFT header, followed by attributes (which themselves consist of headers and contents). This first attribute’s offset (from the start of the entry) is stored in bytes 20–21. Here, its value is 56 (0x38).

Attributes: headers and contents

Let’s skip to the attribute, which starts with a header in a standard format. See Tables 13.2–13.4. The first 16 bytes are the same in resident and non-resident attribute headers; after that they diverge.

The header starts with a four-byte type tag. Here, it’s 16, which is a $STANDARD_INFORMATION header. This and $FILE_NAME (48) are two attributes that nearly every entry will have.

The next four bytes (4–7) tell us the length. Here it’s 96; this means the next attribute starts at offset 56 + 96 = 152 (96 from the start of the current attribute’s start).

Byte 8 (offset from 56, remember, so byte 64 in the dump) tells us if the attribute’s content is non-resident. Here it’s zero, so this attribute;s content is resident – that is, it’s embedded in the MFT entry.

(Discussion names for standard attributes? E.g., as ADS?)

Let’s jump ahead to bytes 16–19 (again: offset from 56, so go to 72 0x48) to get the size (72) and bytes 20–21 (offset 76: 0x4c) to get the offset (24) of this attribute’s content. (Sanity check: 72 + 24 = 96, which is the size of the attribute in total and also the offset to the next attribute)

So you get the idea; Tables 13.5 and 13.6 tell you how to parse the $STANDARD_INFORMATION attribute’s contents.

Let’s look at the next attribute for this entry (remember, we’re still looking at the zeroth entry in the MFT). We find it by skipping past the previous one. The previous one started at offset 56 and had a length of 96, so we need to skip ahead to byte 56 + 96 = 152 (0x98) of the entry.

This one has type 48 ($FILE_NAME). Let’s dig into this attribute. We see it’s 104 (0x68) bytes long, so let’s pull it out using dd so we can see offsets from 0 (or just do it in hexfiend):

dd if=zeroth-mft-entry of=file_name_attribute bs=1 skip=152 count=104

00000000  30 00 00 00 68 00 00 00  00 00 18 00 00 00 02 00  |0...h...........|
00000010  4a 00 00 00 18 00 01 00  05 00 00 00 00 00 05 00  |J...............|
00000020  80 64 e1 e7 8b a1 d2 01  80 64 e1 e7 8b a1 d2 01  |.d.......d......|
*
00000040  00 70 00 00 00 00 00 00  00 6c 00 00 00 00 00 00  |.p.......l......|
00000050  06 00 00 00 00 00 00 00  04 03 24 00 4d 00 46 00  |..........$.M.F.|
00000060  54 00 00 00 00 00 00 00                           |T.......|

The first four bytes tell us the type (48); the next four are the length (104 bytes). This attribute is also resident. Let’s skip ahead to bytes 20–21, which are the offset (from the start of the attribute) to the content. Here, it’s 24, so let’s go there.

The $FILE_NAME attribute is described in Table 13.7 on page 362. I’m going to use Hex Fiend to do the extraction here, but it’s the same as using dd previously, or as slicing a sequence of bytes or seek()ing in Python.

dd if=file_name_attribute of=file_name_attribute_content bs=1 skip=24
00000000  05 00 00 00 00 00 05 00  80 64 e1 e7 8b a1 d2 01  |.........d......|
00000010  80 64 e1 e7 8b a1 d2 01  80 64 e1 e7 8b a1 d2 01  |.d.......d......|
00000020  80 64 e1 e7 8b a1 d2 01  00 70 00 00 00 00 00 00  |.d.......p......|
00000030  00 6c 00 00 00 00 00 00  06 00 00 00 00 00 00 00  |.l..............|
00000040  04 03 24 00 4d 00 46 00  54 00 00 00 00 00 00 00  |..$.M.F.T.......|

The first eight bytes are the file reference of the parent directory. File references are composed of two parts: the file number and the sequence number. The file number we’ve already seen: it’s the index into the MFT to get to this entry, starting from zero. The sequence number is incremented each time an MFT entry is allocated to use. The two numbers are concatenated, with the 16-bit sequence number in the higher-order bytes (little endian), and the 48-bit file number in the lower-order bytes (little endian), to form a 64-bit file reference number. Note that like all values, it’s stored little endian, so the final format is:

FF FF FF FF FF FF SS SS

where FN are file number bytes and SS are sequence number bytes.

So in this $FILE_NAME, the file number is 5. 5 is one of the reserved slots in the MFT, which if we look up in Table 11.1, we see is the root directory.

The next sequence of 8 bytes is the file creation time. NTFS stores file-related times as the number of blocks of 100 ns since January 1st, 1601 UTC.

Here, the value is 8064E1E7 8BA1D201 so we can get the number in Python using:

import struct
timestamp = struct.unpack('<Q', bytes.fromhex('8064E1E7 8BA1D201'))[0]

To convert that to a time, we convert to a UNIX-style epoch (the number of seconds since January 1, 1970). This time is 116444736000000000 100 ns blocks since January 1st, 1601 UTC. So to convert, we can write:

import datetime
def as_datetime(windows_timestamp):
    return datetime.datetime.fromtimestamp((windows_timestamp - 116444736000000000) / 10000000)
print(str(as_datetime(timestamp)))

Bytes 16–23 are the modification time; 24–31 are the MFT modification time; 32–39 are the last access time.

Bytes 40–47 are the allocated file size and bytes 48–55 are the actual size, but these are not required to be accurate unless this attribute is used in a directory index.

Bytes 56–59 are flags, just like in FAT (see Table 13.6).

Byte 64 is the length of the filename (4). Byte 65 is the namespace (See Table 13.8.; here it’s the Windows/DOS namespace). Bytes 66 onward are the name, in this case in UTF-16 (LE): $MFT. This is the $MFT entry, just like we expected for entry 0.

Note we can check all this using istat on the relevant entry:

istat simple.ntfs 0-128-1

One last attribute to look at here, the $DATA attribute, which is next. Going back to the zeroth-mft-entry, it starts at offset 256 (from the start of the entry, not the start of the entire MFT). How did we get this? It’s not a magic value: we computed it. Remember, the previous attribute ($FILE_NAME) started at offset 152 and was 54 bytes long. 54+152 = 256.

dd if=zeroth-mft-entry of=data_attribute bs=1 skip=256 count=72
00000000  80 00 00 00 48 00 00 00  01 00 40 00 00 00 01 00  |....H.....@.....|
00000010  00 00 00 00 00 00 00 00  12 00 00 00 00 00 00 00  |................|
00000020  40 00 00 00 00 00 00 00  00 30 01 00 00 00 00 00  |@........0......|
00000030  00 04 01 00 00 00 00 00  00 04 01 00 00 00 00 00  |................|
00000040  11 13 04 00 00 00 00 00                           |........|

The first four bytes (value: 128) tell us it’s the $DATA attribute, and the next four tell us its length (72). The next byte tells us that this is a non-resident attribute, so its contents are stored somewhere in cluster(s) on the disk.

Let’s jump ahead to figure out where (that is, which clusters) the data is stored in.

Looking at Table 13.4, bytes 16–23 and 24–31 are used to tell us the starting and ending VCN of the runlist. The VCN is just a sequence of numbers 0..n-1 referring to the n clusters in a file in order. This is in contrast to the LCN, which is a list of the actual cluster numbers (on disk) that correspond to the VCN clusters. (on board)

Why does the non-resident header have this marker? Because for very long, fragmented files, you might not be able to fit the runlist into a single MFT entry. NTFS then needs to split the attributed across several MFT entries; this is how you figure out “where you are” in the entry.

Then there’s an offset to the runlist (from the start of this attribute) at bytes 32-33. The runlist is in the following format:

First there’s a single byte that describes the length and offset of the next run; then there’s a variable number of bytes describing the length of the run, and a variable number of bytes describing the offset to the run.

In more detail: The first byte is split into two nibbles (4-bit values). The low-order bits tell you the number of bytes in the run length; the high-order bits tell you the number of bytes in the offset to the run.

These values stored in the length and offset are in units of clusters, not bytes or sectors, and the offset bytes are signed.

And, the while the length is what you’d expect, the offset is relative to the previous offset in the runlist (the first offset is relative to the start of the filesystem, that is, cluster 0). Let’s look at some data:

11 13 04

So, byte 1 contains 11, which in binary is

0001 0001

This run is described by a single byte offset and length. The length comes first: 13, so it’s 19 clusters long. The offset comes next: 04, so it starts 4 clusters past the start of the file system. Which is what we expect, again as shown by istat.

istat simple.ntfs 0-128-1

...
Type: $DATA (128-1)   Name: N/A   Non-Resident   size: 66560  init_size: 66560
4 5 6 7 8 9 10 11 
12 13 14 15 16 17 18 19 
20 0 0 
...

Notice there are 19 values there. They start at 4 and then there are 19 of them. The last two are zero. Why? The allocated size of this attribute’s content (bytes 40–47: as an int: 77824) is exactly 19 4kB clusters. But the actual size (bytes 48–55: 66560) fits in 16.25 clusters. So the last two clusters are allocated but not used. istat represents this by showing their numbers as zeroes.

If you continue parsing this entry, you’ll see that there’s one more attribute of type 0xb0, which is a $BITMAP attribute. This is used to track index records in the MFT, but we’ll not concern ourselves with it.

How do you know you are done parsing attributes? The MFT entry is of fixed size, and doesn’t include a “total length” in its header. Instead, you look for the hex value ffff ffff where you’d expect an attribute to begin; that indicates you have finished. In this MFT entry they’re at offset 0x190 from the start of the entry.

Runlist practice

So, the details of MFT entries are somewhat tedious, but mostly straightforward. My observation last year was that for many people, parsing the runlists was the hardest thing. So let’s do another, more complicated example now.

Suppose we had a file that was written to disk in a fragmented way. (This is adapted from Figure 11.6, if you’re curious.)

Let’s say that the first part of the file was written to clusters 48–52 on disk; then the next part ended up at clusters 980–981, and the last part in clusters 56–59. This file is broken up into three separate “runs” on disk. What would the runlist look like once it was decoded?

48, 49, 50, 51, 52, 980, 981, 56, 57, 58, 59

(By the way, this sequence is the “LCN” we talked about earlier; the VCN is just 0, 1, … 10.)

What would the runlist look like?

11 05 30 21 02 a4 03 21 04 64 fc

The first byte would describe the first run’s length and offset. In hex: 0x11; in binary: 0b00010001, which tells us that the first run’s offset value is 1 byte long, and its length is one byte. What are those bytes?

05 30: Note the length comes first. It’s five (clusters long). The offset is 0x30 (in other words, 48 clusters into the start of the disk for the first run, or relative to the previous run’s start if not the first run). So: 48, 49, 50, 51, 52

The next run’s first byte tells us it’s length and offeset. In hex: 0x21, in binary: 0b00100001, which tells us this second run’s offset value is two bytes long, and its length is one byte. What are those bytes?

02 a4 03: Again, length (2) first. Then the offset. 0xa403 is a signed little-endian value, which is 932 in decimal. It’s saying that this run start 932 clusters from the previous run’s offset. So 48 + 932 = 980, for two clusters: 980, 981.

The last run’s first byte tells us it’s length and offset. Just like the previous run, a one byte length and a two-byte offset.

04 64 fc: Length of four then an offset. 0x64fc is a signed value of -924. From the previous offset’s start of 980: 980 + (-924) = 56. So: 56, 57, 58, 59.

That’s how you parse runlists.

One practical complication: There’s no such thing as a three-byte signed value in Python or most languages: typically values are a power-of-two bytes long (1, 2, 4, 8). If a signed value is not “big enough” to fill one of these lengths, you need to pad it before you call struct.unpack on it.

If the leading bit of the MSB is 0, then you pad with zero bytes. For example, the little-endian three-byte value 01 00 00 would be padded to four bytes as follows : 01 00 00 00. Or the value 00 00 0A would be padded to 00 00 0A 00.

If the leading bit of the MSB is 1, then you pad with FF bytes (this preserves the underlying signed twos-complement value). So for example, the value 20 10 80 would be padded to 20 10 80 FF.

Fixup arrays

Probably what you’re thinking right now is that this all makes perfect sense and you wish it could just be made arbitrarily more complicated. Great news! Let’s talk about fixup arrays.

Fixups are a form of integrity checking, where they’ll help NTFS detect corrupted data and thus presuambly corrupted sectors. They store a small signature at the end of each sector’s worth of data in each MFT entry; if the signature is wrong the sector is likely corrupted.

How do they work? Pick an arbitrary 2-byte value (for example 00 01). For each sector in the MFT entry, replace its last two bytes with this arbitrary value. (On board.)

But what about the data there, aren’t we obliterating it? Yes! But don’t we need it? Yes! So can we still recover it? Yes! The data isn’t just overwritten; first it’s stored in the “fixup array”, which is just a byte sequence you can find by parsing the MFT entry header. It has an offset-to-fixup array value, and number-of-fixup-entries value. So you can go look up these values and know what was overwriten. (on board)

So when you’re actually parsing NTFS MFT entries, you’ll need to (pretty early on, before you get to the attributes), parse the fixup array and modify the bytes object that you’re working on to have the correct values in place.