12: Demonstration: Parsing FAT


Midterms are graded so you can see your grade up through HW5 and the midterm. Regrade requests: Start with Gradescope and the person who graded your answer. If you’re dissatisfied with the result, email me separately. I’ll close the Gradescope regrade interface at the end of this week and update Moodle gradebook after they’re all resolved.

The W-ithdraw deadline for undergraduates is tomorrow. Moodle does the math; you decide whether to stay or not.

Snowpocalypse: If the University is closed, hooray! If the snow is bad but University only delays opening to 10, it’s possible I may still cancel class. If I’m going to do so, I’ll post to Piazza by around 8 so people coming in from off campus will know they needn’t do so.

Long File Names

Let’s look back at the root directory entries in adams.dd, and see how to parse an LFN.

dd if=adams.dd bs=512 skip=41 count=32| hexdump -C
00000000  41 44 41 4d 53 20 20 20  20 20 20 28 00 00 00 00  |ADAMS      (....|
00000010  00 00 00 00 00 00 e1 62  1e 39 00 00 00 00 00 00  |.......b.9......|
00000020  41 69 00 6d 00 61 00 67  00 65 00 0f 00 71 73 00  |Ai.m.a.g.e...qs.|
00000030  00 00 ff ff ff ff ff ff  ff ff 00 00 ff ff ff ff  |................|
00000040  49 4d 41 47 45 53 20 20  20 20 20 10 00 00 c4 79  |IMAGES     ....y|
00000050  e1 38 1c 39 00 00 4f 84  1c 39 03 00 00 00 00 00  |.8.9..O..9......|
00000060  41 44 00 65 00 73 00 69  00 67 00 0f 00 d4 6e 00  |AD.e.s.i.g....n.|
00000070  73 00 2e 00 64 00 6f 00  63 00 00 00 00 00 ff ff  |s...d.o.c.......|
00000080  44 45 53 49 47 4e 53 20  44 4f 43 20 00 00 4e 81  |DESIGNS DOC ..N.|
00000090  1c 39 1c 39 00 00 4e 81  1c 39 2d 07 00 72 27 00  |.9.9..N..9-..r'.|
000000a0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|

LFN entries, marked by the 0f value in the attributes (byte offset b – 12 – into the dirent), come before whichever file they refer to; they’re a backward compatible way to give longer names than the old DOS 8.3 system. Older OSes ignore directory entries with this otherwise incompatible set of flags set, which is why Microsoft designed it this way.

LFN entries repurpose most of the fields in the directory entry to store the characters of the filename.

0x0 (1B): sequence number, starting at 1, not 0; last one is ORed with 0x40

0x1 (10 B): 5 UCS-2 characters (UCS-2 is a subset of UTF-16 that can only handle codepoints in the basic multilingual plane)

0xB (1B): attributes

0xC (1B): unused

0xD (1B): checksum

0xE (12B): 6 UCS-2 characters

0x1A (1B): reserved

0x1B (4B): 2 UCS-2 characters

Each LFN can hold 13 characters. If a filename needs more than 13 characters, then more than one LFN will precede the directory entry. They come in reverse order, last first. The last’s sequence number is ORed with the value 0x40. For example, if there was a file with the name “File with very long filename.ext”, which needs 3 LFN entries, the sequence numbers and LFN directory entries would be:

0x43 "me.ext"
0x02 "y long filena"
0x01 "File with ver"

then a regular directory entry for the file.

I’m going to skip the checksum calculation; again, see Carrier or other resources if you need the details.

Recovering a deleted file

Earlier, we found an entry corresponding to a deleted file:

dd if=adams.dd bs=512 skip=75 count=2| hexdump -C
2+0 records in
2+0 records out
1024 bytes transferred in 0.000023 secs (44278013 bytes/sec)
00000000  2e 20 20 20 20 20 20 20  20 20 20 10 00 00 4e 5c  |.          ...N\|
00000010  a1 38 a1 38 00 00 4e 5c  a1 38 03 00 00 00 00 00  |.8.8..N\.8......|
00000020  2e 2e 20 20 20 20 20 20  20 20 20 10 00 00 4e 5c  |..         ...N\|
00000030  a1 38 a1 38 00 00 4e 5c  a1 38 00 00 00 00 00 00  |.8.8..N\.8......|
00000040  e5 4d 47 5f 33 30 32 37  4a 50 47 20 00 00 c4 79  |.MG_3027JPG ...y|
00000050  e1 38 e1 38 00 00 c4 79  e1 38 04 00 8c a0 1c 00  |.8.8...y.8......|
00000060  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|

The file was named “?MG_3027.JPG” – notice that we lose the first character of a filename if it’s deleted. It used to be stored at cluster 4.

Sleuthkit also sees this file (-r shows everything, recursively):

fls -r adams.dd
r/r 3:  ADAMS       (Volume Label Entry)
d/d 5:  images
+ r/r * 549:    _MG_3027.JPG
r/r 7:  Designs.doc
v/v 163171: $MBR
v/v 163172: $FAT1
v/v 163173: $FAT2
d/d 163174: $OrphanFiles

or if we want to go entry-by-entry:

fls adams.dd
r/r 3:  ADAMS       (Volume Label Entry)
d/d 5:  images
r/r 7:  Designs.doc
v/v 163171: $MBR
v/v 163172: $FAT1
v/v 163173: $FAT2
d/d 163174: $OrphanFiles

fls adams.dd 5
r/r * 549:  _MG_3027.JPG

Where does this metadata address of 549 come from? It’s clearly not a cluster number or whatnot, right?

Many filesystems have a concept of “inodes”, which are unique metadata addresses that files and directories share. Not FAT. So instead TSK generates unique metadata addresses for FAT. The root directory entry is given the value 2. Each sector of the disk, starting at the beginning of the data area, could hypothetically contain 16 entries, so we number them starting from 3. This means that, say, the 512 entries in the root directory are numbered 3–514. Then there’s gonna be gaps, since most sectors don’t actually hold directory entries.

Recall that our cluster area started at sector 73, and the directory entries we extracted for this “IMAGES” directory were at cluster 75. If sector 73 and 74 were full of directory entries, there’d be 32 (16 per sector) in each of them. And our deleted file is the 3rd entry in the next sector.

514 + 32 + 3 = 549, the metadata address. Boom.

OK, how long was this file before it was deleted? Looking at the last four bytes of the directory entry 8c a0 1c 00 show it was 1,876,108 bytes long, which would have required 1833 1KB clusters to store.

Interestingly, that’s exactly how many clusters are currently marked as unallocated between its old starting cluster (4) and the next cluster allocated on the disk (1837). I wonder if those bytes look like a JPEG? Remember, cluster 4 starts two clusters past the start of the cluster area, which is sector 73 + 4 = 77.

dd if=adams.dd of=IMG_3027.JPG bs=512 skip=77 count=3666
hexdump -Cv IMG_3027.JPG|less

Those headers look familiar to you at all?

This is (almost) what icat (remember that?) from the second lecture does. icat is a little smarter. For example, it will truncate the file to the file size listed in the directory entry.

Building and then parsing a filesystem

(This will be helpful to you when doing the next assignment, which I’ll be putting up shortly.)

This is all being done on an Ubuntu virtual machine, using Vagrant to manage it. We are just creating the filesystem; no MBR.

# create a new empty file
dd if=/dev/zero of=fat.dd bs=1M count=10

# view it
hexdump -C fat.dd

# create a FAT filesystem
mkfs.fat fat.dd 

# view it
hexdump -C fat.dd

# view it in sleuthkit
fsstat fat.dd 
fls fat.dd

Parsing it

Can we get some essentials out of this ourselves? In particular, the cluster size, the first FAT, root directory area, and cluster area? (Code at end of notes; compare with fsstat output.)

Two asides

First, you can work on sequences of bytes or directly on a file-like object:

with open('fat.dd', 'rb') as f:
    data = f.read()

x = data[i:j] 

# is equivalent to

x = f.read(j-i + 1)

The former is maybe ergonomically easier but does require that you load the entire file into memory, which is not always feasible.

Second, indexing into a sequence is different from slicing an sequence:

bytes_sequence[i]  # returns the i-th element of # is equivalent to
bytes_sequence[i:i+1]  # returns a sequence consisting of the i-th element of bytes_sequence

This distinction is particularly important when passing arguments to struct.unpack, as it expects a bytes sequence as its second argument, not a single value.

On with the show

# mount it ; sync it so changes show up immediately in our disk image
mkdir mnt
sudo mount -o sync fat.dd mnt/

# view it
fls fat.dd
hexdump -C fat.dd

# add a file
nano hello.txt
sudo cp hello.txt mnt/

# view it
fls fat.dd
hexdump -C fat.dd

Can we parse this directory entry? (Code at end of notes; compare with fsstat and fls output.)

# make a 2-cluster file
dd if=/dev/urandom of=random.dat bs=2048 count=2
sudo cp random.dat mnt/

Can we parse this directory entry? (Code at end of notes; compare with fsstat and fls output.)

Code from class follows:

import struct

def as_le_unsigned(b):
    table = {1: 'B', 2: 'H', 4: 'L', 8: 'Q'}
    return struct.unpack('<' + table[len(b)], b)[0]

def get_sector_size(fs_bytes):
    return as_le_unsigned(fs_bytes[11:13])

def get_cluster_size(fs_bytes):
    return as_le_unsigned(fs_bytes[13:14]) * get_sector_size(fs_bytes)

def get_reserved_area_size(fs_bytes):
    return as_le_unsigned(fs_bytes[14:16]) * get_sector_size(fs_bytes)

def get_fat_size(fs_bytes):
    return as_le_unsigned(fs_bytes[22:24]) * get_sector_size(fs_bytes)

def get_fat0(fs_bytes):
    start = get_reserved_area_size(fs_bytes)
    length = get_fat_size(fs_bytes)
    return fs_bytes[start:start + length]

def get_number_of_fats(fs_bytes):
    return as_le_unsigned(fs_bytes[16:17])

def get_max_root_directory_entries(fs_bytes):
    return as_le_unsigned(fs_bytes[17:19])

def get_root_directory_area(fs_bytes):
    start = get_reserved_area_size(fs_bytes) + get_number_of_fats(fs_bytes) * get_fat_size(fs_bytes)
    length = get_max_root_directory_entries(fs_bytes) * 32  # 32 bytes / entry
    return fs_bytes[start:start + length]

def get_sector_count(fs_bytes):
    return max(as_le_unsigned(fs_bytes[19:21]), as_le_unsigned(fs_bytes[32:36]))

def get_cluster_area(fs_bytes):
    fs_size = get_sector_count(fs_bytes) * get_sector_size(fs_bytes)

    start = get_reserved_area_size(fs_bytes) + get_number_of_fats(fs_bytes) * get_fat_size(fs_bytes) \
            + get_max_root_directory_entries(fs_bytes) * 32

    number_of_clusters = (fs_size - start) // get_cluster_size(fs_bytes)
    length = number_of_clusters * get_cluster_size(fs_bytes)

    return fs_bytes[start:start + length]

def get_filename(dirent):
    return dirent[0:8].decode('ascii').strip() + '.' + dirent[8:11].decode('ascii')

def get_first_cluster(dirent):
    return as_le_unsigned(dirent[26:28])

def get_filesize(dirent):
    return as_le_unsigned(dirent[28:32])

def get_cluster_numbers(first_cluster, fat_bytes, cluster_size):
    result = [first_cluster]
    offset = 2 * first_cluster
    next_cluster = as_le_unsigned(fat_bytes[offset:offset + 2])
    while next_cluster < as_le_unsigned(b'\xf8\xff'):
        offset = 2 * next_cluster
        next_cluster = as_le_unsigned(fat_bytes[offset:offset + 2])
    return result

def main():
    with open('fat.dd', 'rb') as f:
        data = f.read()
    print('sector size:', get_sector_size(data))
    print('cluster size:', get_cluster_size(data))
    print('reserved area size:', get_reserved_area_size(data))
    print('FAT size:', get_fat_size(data))
    print('number of FATs:', get_number_of_fats(data))
    print('max root entries:', get_max_root_directory_entries(data))
    print('sector count:', get_sector_count(data))

    root_directory_entries = get_root_directory_area(data)

    dirent = root_directory_entries[32 * 3: 32 * 4]
    print('filename:', get_filename(dirent))
    print('first cluster:', get_first_cluster(dirent))
    print('file size:', get_filesize(dirent))
    print('cluster numbers:', get_cluster_numbers(get_first_cluster(dirent), get_fat0(data), get_cluster_size(data)))

if __name__ == '__main__':