04: Carving, Strings, and Unicode
Carving
“Carving” is a generic term for extracting and assembling the “interesting” bits from a larger collection of bits. The definition of “interesting” will grow more complex, as will the process of extraction and assembly, but the underlying idea will remain the same. The best type of carving is parsing, which is where we interpret more carefully the bits we have carved as part of some well-defined specification.
What distinguishes carving from parsing? Informally, carving is where you find candidates for something that could be parsed. For example, all JPG files start with 0xFFD8
and end with 0xFFD9
. If we want to recover deleted JPGs, we run through a filesystem and simply look for 0xFFD8
and then find a subsequent 0xFFD9
; then we say that everything in between is a candidate for being part of the same JPG file. There is more to JPG than the start and end fields; therefore we parse the candidate file and see if other aspects of the JPG standard parse correctly or in error. If it’s in error, then we don’t have a JPG file (or at least not a complete JPG, we might have half of one). So let’s talk about this more.
Carving text (ASCII) from files
Suppose we don’t know anything about a file or filetype. In the long term, we might take the time to reverse engineer the file type from both existing data, source code we might have access to, or worst-case reverse engineering the executable. (You can enrolled in CMPSCI 390R if you are interested in learning how to examine assembly language in compiled code.)
But let’s stick to the short term and try to extract meaningful data from the file. The simplest form of data we might try to pull out is text. How can we do this? A naive algorithm is to read bytes sequentially, outputting each run of bytes that represents valid ASCII text. We might set a minimum length on the runs to help ensure we’re getting valid values, and not just random values.
The strings
utility is installed on most UNIX machines, and by default extracts ASCII strings from a given input consisting of four or more printable characters. The version of strings
installed determines some of the fiddly behavior, like whether it only considers strings that are NUL or newline terminated.
If you run strings
on a text file, then you just get the lines of that file that contain four or more characters:
# -e is to turn on escape characters ('\n') in my version of `echo`
> echo -e "Hello Marc\nabc\n\nGoodbye Marc" > test.txt
# `cat` sends its input to standard output
> cat test.txt
Hello Marc
abc
Goodbye Marc
> strings test.txt
Hello Marc
Goodbye Marc
The GNU version of strings is installed on the edlab as /usr/bin/strings
allows you to search files not just for ASCII, but for general unicode in UTF-8, UTF-16, or UTF-32; for the last two, it lets you specify little or big-endian encodings — these are specified using the -e
option. (More on this topic in a bit.)
> strings --help
Usage: strings [option(s)] [file(s)]
Display printable strings in [file(s)] (stdin by default)
The options are:
-a - --all Scan the entire file, not just the data section [default]
-d --data Only scan the data sections in the file
-f --print-file-name Print the name of the file before each string
-n --bytes=[number] Locate & print any NUL-terminated sequence of at
-<number> least [number] characters (default 4).
-t --radix={o,d,x} Print the location of the string in base 8, 10 or 16
-w --include-all-whitespace Include all whitespace as valid string characters
-o An alias for --radix=o
-T --target=<BFDNAME> Specify the binary file format
-e --encoding={s,S,b,l,B,L} Select character size and endianness:
s = 7-bit, S = 8-bit, {b,l} = 16-bit, {B,L} = 32-bit
-s --output-separator=<string> String used to separate strings in output.
@<file> Read options from <file>
-h --help Display this information
-v -V --version Print the program's version number
Btw, if you want to use strings, you can install as follows. Command line arguments differ.
- Windows: download page
- WSL/Linux: install with
sudo apt install binutils
- MacOS: install with
brew install binutils
Carving Unicode / UTF-8
Not all text is ASCII. Unicode is an encoding of characters beyond ASCII for languages that need more than a–z. Unicode maps characters to numbers, which the standard refers to as code points. The various “Unicode Transformation Formats” (UTF) schemes map code point values to particular byte encodings. Code points are just hex values, and are often written as U+XXXX, where XXXX is the hex value.
For example, the code point for £
is
- 163 (in decimal); or
- 0xa3 (in hex); and
- U+A3 (or U+00A3) when written as a Unicode code point.
Let’s consider how UTF-8 encodes Unicode code points into bytes.
Characters mapped to code points expressible in:
- 7-bits (or fewer) are encoded in a single byte.
- 8 bits to 11 bits are encoded in two bytes.
- 12 bits to 16 bits are encoded in three bytes.
- 17 bits to 21 bits are encoded in four bytes.
How?
- 00–07 bits: from 0x00 to 0x7F: 0XXXXXXX (where each X is the bit from the character)
- 08–11 bits: from 0x80 to 0x7FF: 110XXXXX 10XXXXXX
- 12–16 bits: from 0x800 to 0xFFFF: 1110XXXX 10XXXXXX 10XXXXXX
- 17–21 bits: from 0x10000 to 0x10FFFF: 11110XXX 10XXXXXX 10XXXXXX 10XXXXXX
Method: Take the code point value and write it in binary using the minimum number of bits it will fit in (7, 11, 16, or 21), left-padding with zero bits. Then pack them left-to-right into the patters above, replacing the XXXs with the bits of the code point. (Side note: it’s always big endian on any machine you are on.)
Example:
Consider the £
symbol; we can ask Python for its Unicode value (its code point) using ord()
, and we find it’s 163. That’s 0xa3 (or 0b10100011, which is eight bits). So it’s going to be encoded in 11 bits as a two-byte UTF-8 value. What value?
>>> ord('£')
163
>>> bin(163)
'0b10100011'
Padding zeros on the left to get to 11 bits: 000 1010 0011
Let’s pack the 11 bits into 2 bytes (16 bits): 110XXXXX 10XXXXXX
becomes 11000010 10100011
Now let’s ask Python what these two values are in decimal:
>>> 0b11000010
194
>>> 0b10100011
163
Let’s check our work by asking python for the actual UTF-8 encoding of £
:
>>> '£'.encode('utf-8')
b'\xc2\xa3'
If you know how to read raw byte encodings, that looks good. But if you can’t, you can see the decimal values like so:
>>> u = '£'.encode('utf-8')
>>> u[0], u[1]
(194, 163)
(Why does u[0] give an integer? Because u is a bytes type… We’ll get to that in a bit.)
Note that there are a few valid-seeming UTF-8 byte sequences that are, per the standard, not considered valid or are not valid all the time. See Wikipedia https://en.wikipedia.org/wiki/Unicode under the discussion about low and high surrogates, non-characters, reserved and private-use codes, etc.
So, how can you tell if text is valid UTF-8? Try to decode it! If you can, it probably is. Now, whether it’s semantically meaningful or not is a different story.
What are the benefits of this approach:
- backward-compatible with ASCII! All ASCII characters are the same in ASCII and UTF-8.
- Single and multi-byte characters are distinct
- The first byte indicates byte sequence length, entails the “prefix property”: no valid sequence is a prefix of any other
- Self-sync: single bytes, leading bytes, continuation bytes are distinct, so we can seek to the (potential) next character in either direction trivially
UTF-16
Just a few words about UTF-16.
Like UTF-8, it’s an encoding of Unicode code points (numbers) into bytes. Like UTF-8, it’s variable-width. Unlike UTF-8, each “unit” is 16-bits (two bytes) wide, and does not define a specific endianness to use. But like UTF-8, it encodes some of Unicode directly. UTF-8 encodes the first 128 code points (denoted as U+00 to U+7F) directly into one byte; UTF-16 encodes U+0000 to U+D7FF, and it encodes U+E000 to U+FFFF directly to two bytes, and uses a four-byte (two sets of two-byte pairs) encoding for code points outside this range.
Notably, this means the ASCII subset of Unicode is encoded as sequences of alternating zero bytes and ASCII characters. For example, an ASCII encoding of “Marc” would be 4D 61 72 63
; in UTF-16, it would read 00 4D 00 61 00 72 00 63
(and in little endian UTF-16, 4D 00 61 00 72 00 63 00
).
Writing a hacky UTF-16 ASCII string extractor is pretty straightforward as a result.
To help with endianness, UTF-16 has a “byte-order mark.” The BOM is U+FEFF, which is the “non-breaking zero-width space” character, which is inserted at the top/front of UTF-16 data. If missing, the standard says to assume big-endian, though many Windows applications and APIs assume little-endian. You can also specify the encoding as UTF-16BE / UTF-16LE and omit the BOM (for example, if you are using HTTP, you can set the encoding in the Content-Type
header, typically something like Content-Type: text/html; charset=utf-8
).
For example, in the designs.doc on adams.dd, we can see from HexDump that there is some UTF-16 unicode
00276bf0 00 00 00 00 00 00 00 00 3f 1a 02 00 00 00 00 00 |........?.......|
00276c00 05 00 53 00 75 00 6d 00 6d 00 61 00 72 00 79 00 |..S.u.m.m.a.r.y.|
00276c10 49 00 6e 00 66 00 6f 00 72 00 6d 00 61 00 74 00 |I.n.f.o.r.m.a.t.|
00276c20 69 00 6f 00 6e 00 00 00 00 00 00 00 00 00 00 00 |i.o.n...........|
00276c30 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00276c40 28 00 02 01 ff ff ff ff ff ff ff ff ff ff ff ff |(...............|
00276c50 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
But when we run strings
, we wont see this unicode because each ASCII character is separated by a zero and strings by
default won’t print anything unless there are at least 4 printable characters in a row. We can use
GNU strings
to find these values as it can be told a UTF encoding to use. Let’s tell it to use 16-bit little endian.
/usr/local/Cellar/binutils/2.37/bin/strings -e l Designs.doc
And part of our output will be
SummaryInformation
(Make sure you are calling the GNU strings you’ve installed. The one that comes with MacOS doesn’t do UTF.)