04: Carving, strings, and UTF-8

Announcements

Carving

"Carving" is a generic term for extracting and assembling the "interesting" bits from a larger collection of bits. At a high level, a substantial portion of the technical content we're going to cover through will be successive applications of carving techniques. The definition of "interesting" will grow more complex, as will the process of extraction and assembly, but the underlying idea will remain the same.

These techniques are very similar to parsing, and we can in fact use the same general family of techniques. In this course, you'll hand-write most tools yourself, but know that if you take a compilers course, some of what what we're going to do will have a lot of overlap with lexers and parsers.

Carving text (ASCII) from files

Suppose we don't know anything about a file or filetype. In the long term, we might take the time to reverse engineer the file type from both existing data, source code we might have access to, or worst-case binary reverse engineering.

But in the short term, we might try to extract meaningful data from the file. The simplest form of data we might try to pull out is text. How can we do this? A naive algorithm is to read bytes sequentially, outputting each run of bytes that represents valid ASCII text. We might set a minimum length on the runs to help ensure we're getting valid values, and not just random values.

(Show the strings state machine.)

The strings utility is installed on most UNIX machines, and by default extracts ASCII strings from a given input consisting of four or more printable characters. The version of strings installed determines some of the fiddly behavior, like whether it only considers strings that are NUL or newline terminated.

If you run strings on a text file, then you just get the lines of that file that contain four or more characters:

# -e is to turn on escape characters ('\n') in my version of `echo`
> echo -e "Hello Marc\nabc\n\nGoodbye Marc" > test.txt  
# `cat` sends its input to standard output
> cat test.txt 
Hello Marc
abc

Goodbye Marc
> strings test.txt
Hello Marc
Goodbye Marc

The GNU version of strings (installed on my Mac as gstrings) allows you to search files not just for ASCII, but for general unicode in UTF-8, -16, or -32; for the last two, it lets you specify little or big-endian encodings --- these are specified using the -e option. (More on this topic in a bit.)

> man gstrings
> gstrings -h
Usage: gstrings [option(s)] [file(s)]
 Display printable strings in [file(s)] (stdin by default)
 The options are:
  -a - --all                Scan the entire file, not just the data section [default]
  -d --data                 Only scan the data sections in the file
  -f --print-file-name      Print the name of the file before each string
  -n --bytes=[number]       Locate & print any NUL-terminated sequence of at
  -<number>                   least [number] characters (default 4).
  -t --radix={o,d,x}        Print the location of the string in base 8, 10 or 16
  -w --include-all-whitespace Include all whitespace as valid string characters
  -o                        An alias for --radix=o
  -T --target=<BFDNAME>     Specify the binary file format
  -e --encoding={s,S,b,l,B,L} Select character size and endianness:
                            s = 7-bit, S = 8-bit, {b,l} = 16-bit, {B,L} = 32-bit
  -s --output-separator=<string> String used to separate strings in output.
  @<file>                   Read options from <file>
  -h --help                 Display this information
  -v -V --version           Print the program's version number
gstrings: supported targets: mach-o-x86-64 mach-o-i386 mach-o-le mach-o-be mach-o-fat pef pef-xlib sym plugin srec symbolsrec verilog tekhex binary ihex
Report bugs to <http://www.sourceware.org/bugzilla/>

Data validity

How do we know strings extracted in this way are meaningful? In general, we don't, though there might be a line of inductive reasoning that could apply.

A supporting line of evidence might be argued probabilistically. What are the odds that n sequential bytes are ASCII? p^n, where p is the probability they're each ASCII. If you assume that each character is generated IID, p = 95/256. (That's kind of a weird assumption, unless your data source is a random number generator, though.)

That's for a single sequence; what if you want to ask if by chance alone we found a run of n bytes out of m that were ASCII? There are n - m + 1 such runs. "a run" is the opposite of "no runs". No runs is

((1-p)^n) ^ (n - m + 1), so consider 1 - that quantity. Or should you? These runs are definitely not IID, since each successive run contains a fraction (n-1)/n of the previous run, in order! Ultimately, we can play probability games for whatever question you want to ask. It's important, then, to note that things like strings are best used in a way to help generate hypothesis or to reconstruct unknown file formats, and not generally to (attempt to) form inductive hypotheses.

Carving Unicode / UTF-8

Not all text is ASCII. Recall that Unicode maps characters to code points (numbers), and the various UTF schemes map code points to particular byte encodings. Code points are just hex values, and are often written as U+XXXX, where XXXX is the hex value. So, for example, the code point for £ is 163 (in decimal), 0xa3 (in hex) or U+A3 (or U+00A3) written as a Unicode code point.

Let's consider how UTF-8 encodes Unicode code points into bytes.

7-bit characters are encoded in a single byte.
11-bit characters are encoded in two bytes.
16-bit characters are encoded in three bytes.
21-bit characters are encoded in four bytes.

How?

Let's look at 7-bit characters.

0x00 -- 0x7F: 0XXXXXXX (where each X is the bit from the character)
0x80 -- 0x7FF: 110XXXXX 10XXXXXX
0x800 -- 0xFFFF: 1110XXXX 10XXXXXX 10XXXXXX 0x10000 -- 0x10FFFF 11110XXX 10XXXXXX 10XXXXXX 10XXXXXX

Take the code point value and write it in binary using the minimum number of bits it will fit in (7, 11, 16, or 21), left-padding with zero bits. Then pack them left-to-right into the patters above, replacing the XXXs with the bits of the code point.

Benefits:

backward-compatible with ASCCI
single and multi-byte characters are distinct
first byte indicates byte sequence length, entails the "prefix property": no valid sequence is a prefix of any other
self-sync: single bytes, leading bytes, continuation bytes are distinct, so we can seek to the (potential) next character in either direction trivially

Example:

Consider the £ symbol; we can ask Python for its Unicode value (its code point) using ord, and we find it's 163. That's 0xa3 (binary: 10100011, eight bits), so it's going to be encoded in 11 bits as a two-byte UTF-8 value. What value?

bin(163) = 0b10100011; we are going to pack it into 11 bits, so adding padding zeros on the left: 000 1010 0011

Let's pack the bytes appropriate:

110XXXXX 10XXXXXX

11000010 10100011

Now let's ask Python what these two values are:

0b11000010 = 194

0b10100011 = 163

OK. What's the actual encoding of £, assuming Python does the right thing?

'£'.encode('utf-8')
# -> b'\xc2\xa3'

If you know how to read raw byte encodings, that looks good. Breaking it down:

u = '£'.encode('utf-8')
u[0], u[1]
# ->  (194, 163)

Note that there are a few valid-seeming UTF-8 byte sequences that are, per the standard, not considered valid or are not valid all the time. See Wikipedia https://en.wikipedia.org/wiki/Unicode under the discussion about low and high surrogates, noncharacters, reserved and private-use codes, etc.

So, how can you tell if text is valud UTF-8? Try to decode it! If you can, it probably is. Now, whether it's semantically meaningful or not is a different story.

Other notes

Also note that there is more than one encoding type! For example, UTF-16, which you might want to know about. Windows, in particular, uses UTF-16, which works on pairs of bytes. And when you have pairs of bytes (or generally, more than one byte), what else do you have? Endianness.

To help with endianness, UTF-16 has a "byte-order mark." The BOM is U+FEFF, which is the "non-breaking zero-width space" character, which is inserted at the top/front of UTF-16 data. If missing, the standard says to assume big-endian, though many Windows applications and APIs assume little-endian. You can also specify the encoding as UTF-16BE / UTF-16LE and omit the BOM (for example, if you are using HTTP, you can set the encoding in the Content-Type header, typically something like Content-Type: text/html; charset=utf-8).

GNU strings can be told which encoding to use.

Using external knowledge for encoding

Usually you know something about the data you're searching through. If so, use that knowledge. Or just try them all, I suppose.

The following is the programmer folk wisdom for determining unknown input's format:

If it has a BOM, use the BOM to identify it as UTF-8, UTF-16 or UTF-32. This is the only code path that will ever identify a file as UTF-16 or UTF-32, interestingly enough.
If it contains only ASCII characters, identify it as ASCII.
If it contains ASCII plus 0xA0 (nbsp), supposedly a semi-common case, identify it as ISO Latin 1 (aka ISO-8859-1), similar to but different from Windows-1252.
If none of the above match, run a UTF-8 prober. This is usually overkill, because UTF-8 has a very distinctive and strict structure, and you can usually identify it simply by asking: "Does this contain at least one UTF-8 character (if not, exit early) and if so, does it parse as valid UTF-8?" If true, it's generally reasonable to assume UTF-8.

In general data recovery scenarios you can use uchardet, which uses a slightly more sophisticated algorithm and can correctly identify many of the pre-unicode text encodings.

Note that you can't use uchardet or the like on general binary data; only on data you know (or suspect) to consist of text.

Carving files out of files: JPEGs in DOCs

(A sample of things to come!)

JPEG is an encoding; JFIF is the file format.

Interestingly, the start and end of the image stored in JFIF are marked with a particular sequence of bytes (0xFFD8 / 0xFFD9). And, if we read about JPEG, we see that any byte 0xFF has a 0x00 byte appended, to prevent "framing errors," in other words. More generally, 0xFF bytes in a ".jpg" (that is, a JFIF containing a JPEG) are used to denote something special, as we'll see later.

But for now, what is the implication of the start and end sequences (0xFFD8 / 0xFFD9)? Well, taking in combination with the fact that you can "embed" JPEGs in most file formats, it means we can carve (and recover) the original JPEG from a file where it's been embedded. We look for each 0xFFD8 followed by a 0xFFD9. We then write all the data between each pair of markers (including the markers) into their own file.

(Demo)

What else might the 0xFF bytes encode? Remember EXIF? We'll probably get to some of these details next class.