04: Carving, Strings, and Unicode

Carving

“Carving” is a generic term for extracting and assembling the “interesting” bits from a larger collection of bits. The definition of “interesting” will grow more complex, as will the process of extraction and assembly, but the underlying idea will remain the same. The best type of carving is parsing, which is where we interpret more carefully the bits we have carved as part of some well-defined specification.

What distinguishes carving from parsing? Informally, carving is where you find candidates for something that could be parsed. For example, all JPG files start with 0xFFD8 and end with 0xFFD9. If we want to recover deleted JPGs, we run through a filesystem and simply look for 0xFFD8 and then find a subsequent 0xFFD9; then we say that everything in between is a candidate for being part of the same JPG file. There is more to JPG than the start and end fields; therefore we parse the candidate file and see if other aspects of the JPG standard parse correctly or in error. If it’s in error, then we don’t have a JPG file (or at least not a complete JPG, we might have half of one). So let’s talk about this more.

Carving text (ASCII) from files

Suppose we don’t know anything about a file or filetype. In the long term, we might take the time to reverse engineer the file type from both existing data, source code we might have access to, or worst-case reverse engineering the executable. (You can enrolled in CMPSCI 390R if you are interested in learning how to examine assembly language in compiled code.)

But let’s stick to the short term and try to extract meaningful data from the file. The simplest form of data we might try to pull out is text. How can we do this? A naive algorithm is to read bytes sequentially, outputting each run of bytes that represents valid ASCII text. We might set a minimum length on the runs to help ensure we’re getting valid values, and not just random values.

The strings utility is installed on most UNIX machines, and by default extracts ASCII strings from a given input consisting of four or more printable characters. The version of strings installed determines some of the fiddly behavior, like whether it only considers strings that are NUL or newline terminated.

If you run strings on a text file, then you just get the lines of that file that contain four or more characters:

# -e is to turn on escape characters ('\n') in my version of `echo`
> echo -e "Hello Marc\nabc\n\nGoodbye Marc" > test.txt  
# `cat` sends its input to standard output
> cat test.txt 
Hello Marc
abc

Goodbye Marc
> strings test.txt
Hello Marc
Goodbye Marc

The GNU version of strings is installed on the edlab as /usr/bin/strings allows you to search files not just for ASCII, but for general unicode in UTF-8, UTF-16, or UTF-32; for the last two, it lets you specify little or big-endian encodings — these are specified using the -e option. (More on this topic in a bit.)

> strings --help
Usage: strings [option(s)] [file(s)]
 Display printable strings in [file(s)] (stdin by default)
 The options are:
  -a - --all                Scan the entire file, not just the data section [default]
  -d --data                 Only scan the data sections in the file
  -f --print-file-name      Print the name of the file before each string
  -n --bytes=[number]       Locate & print any NUL-terminated sequence of at
  -<number>                   least [number] characters (default 4).
  -t --radix={o,d,x}        Print the location of the string in base 8, 10 or 16
  -w --include-all-whitespace Include all whitespace as valid string characters
  -o                        An alias for --radix=o
  -T --target=<BFDNAME>     Specify the binary file format
  -e --encoding={s,S,b,l,B,L} Select character size and endianness:
                            s = 7-bit, S = 8-bit, {b,l} = 16-bit, {B,L} = 32-bit
  -s --output-separator=<string> String used to separate strings in output.
  @<file>                   Read options from <file>
  -h --help                 Display this information
  -v -V --version           Print the program's version number

Btw, if you want to use strings, you can install as follows. Command line arguments differ.

Carving Unicode / UTF-8

Not all text is ASCII. Unicode is an encoding of characters beyond ASCII for languages that need more than a–z. Unicode maps characters to numbers, which the standard refers to as code points. The various “Unicode Transformation Formats” (UTF) schemes map code point values to particular byte encodings. Code points are just hex values, and are often written as U+XXXX, where XXXX is the hex value.

For example, the code point for £ is

Let’s consider how UTF-8 encodes Unicode code points into bytes.

Characters mapped to code points expressible in:

How?

Method: Take the code point value and write it in binary using the minimum number of bits it will fit in (7, 11, 16, or 21), left-padding with zero bits. Then pack them left-to-right into the patters above, replacing the XXXs with the bits of the code point. (Side note: it’s always big endian on any machine you are on.)

Example:

Consider the £ symbol; we can ask Python for its Unicode value (its code point) using ord(), and we find it’s 163. That’s 0xa3 (or 0b10100011, which is eight bits). So it’s going to be encoded in 11 bits as a two-byte UTF-8 value. What value?

>>> ord('£')
163
>>> bin(163)
'0b10100011'

Padding zeros on the left to get to 11 bits: 000 1010 0011

Let’s pack the 11 bits into 2 bytes (16 bits): 110XXXXX 10XXXXXX becomes 11000010 10100011

Now let’s ask Python what these two values are in decimal:

>>> 0b11000010
194
>>> 0b10100011
163

Let’s check our work by asking python for the actual UTF-8 encoding of £:

>>> '£'.encode('utf-8')
 b'\xc2\xa3'

If you know how to read raw byte encodings, that looks good. But if you can’t, you can see the decimal values like so:

>>> u = '£'.encode('utf-8')
>>> u[0], u[1]
(194, 163)

(Why does u[0] give an integer? Because u is a bytes type… We’ll get to that in a bit.)

Note that there are a few valid-seeming UTF-8 byte sequences that are, per the standard, not considered valid or are not valid all the time. See Wikipedia https://en.wikipedia.org/wiki/Unicode under the discussion about low and high surrogates, non-characters, reserved and private-use codes, etc.

So, how can you tell if text is valid UTF-8? Try to decode it! If you can, it probably is. Now, whether it’s semantically meaningful or not is a different story.

What are the benefits of this approach:

UTF-16

Just a few words about UTF-16.

Like UTF-8, it’s an encoding of Unicode code points (numbers) into bytes. Like UTF-8, it’s variable-width. Unlike UTF-8, each “unit” is 16-bits (two bytes) wide, and does not define a specific endianness to use. But like UTF-8, it encodes some of Unicode directly. UTF-8 encodes the first 128 code points (denoted as U+00 to U+7F) directly into one byte; UTF-16 encodes U+0000 to U+D7FF, and it encodes U+E000 to U+FFFF directly to two bytes, and uses a four-byte (two sets of two-byte pairs) encoding for code points outside this range.

Notably, this means the ASCII subset of Unicode is encoded as sequences of alternating zero bytes and ASCII characters. For example, an ASCII encoding of “Marc” would be 4D 61 72 63; in UTF-16, it would read 00 4D 00 61 00 72 00 63 (and in little endian UTF-16, 4D 00 61 00 72 00 63 00).

Writing a hacky UTF-16 ASCII string extractor is pretty straightforward as a result.

To help with endianness, UTF-16 has a “byte-order mark.” The BOM is U+FEFF, which is the “non-breaking zero-width space” character, which is inserted at the top/front of UTF-16 data. If missing, the standard says to assume big-endian, though many Windows applications and APIs assume little-endian. You can also specify the encoding as UTF-16BE / UTF-16LE and omit the BOM (for example, if you are using HTTP, you can set the encoding in the Content-Type header, typically something like Content-Type: text/html; charset=utf-8).

For example, in the designs.doc on adams.dd, we can see from HexDump that there is some UTF-16 unicode

00276bf0  00 00 00 00 00 00 00 00  3f 1a 02 00 00 00 00 00  |........?.......|
00276c00  05 00 53 00 75 00 6d 00  6d 00 61 00 72 00 79 00  |..S.u.m.m.a.r.y.|
00276c10  49 00 6e 00 66 00 6f 00  72 00 6d 00 61 00 74 00  |I.n.f.o.r.m.a.t.|
00276c20  69 00 6f 00 6e 00 00 00  00 00 00 00 00 00 00 00  |i.o.n...........|
00276c30  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00276c40  28 00 02 01 ff ff ff ff  ff ff ff ff ff ff ff ff  |(...............|
00276c50  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|

But when we run strings, we wont see this unicode because each ASCII character is separated by a zero and strings by default won’t print anything unless there are at least 4 printable characters in a row. We can use GNU strings to find these values as it can be told a UTF encoding to use. Let’s tell it to use 16-bit little endian.

/usr/local/Cellar/binutils/2.37/bin/strings -e l Designs.doc And part of our output will be SummaryInformation

(Make sure you are calling the GNU strings you’ve installed. The one that comes with MacOS doesn’t do UTF.)