04: Carving, strings, and Unicode
Welcome
Announcements
The most important thing to know today: the course web site is at http://people.cs.umass.edu/~liberato/courses/2019-spring-compsci365/. It’s linked to from Moodle, too. (This announcement will go away after add/drop ends.)
If you need to be added to Piazza due to a recent add, please email ASAP to request access.
A bit more Python
Some stuff you might find useful lives in the Python Standard Library, which is well documented here.
Here are some built-in functions:
print()
, like, prints stuff. It can take more than one object and prints their string representation to a file (stdout by default). You can point it at other files with the file=
argument; you can change the separator between objects with the sep=
argument; you can change the end of line marker (by default: a newline) with the end=
argument.
open()
in a with
block, which closes the file – note the mode
parameter to open the file in binary! If you read()
from a file open in binary mode, you get back a bytes object. You can index it like an array to get the nth byte, which will be returned as an int; or you can “slice” it just like any other sequence in Python. Byte objects when displayed in the REPL show the ASCII value if the character is printable, or \xHH
if it’s not, where HH
will be the hexadecimal value of the byte.
hex()
returns the hex value (as a string) of a given int
; might be useful when converting individual bytes to hex.
You might also consider string formatting. If you’ve used C before, you probably have plumbed some of printf
‘s depths. Python’s format()
has similar depth. Used simply, you can say something like: format('Hello {}', 'world')
. The {}
are interpolated with the following argument(s).
But it’s more than that. You can also tell it to format values in certain ways, like floating point with a certain number of digits, or left- or right- alignment within a certain number of columns, or leading zeros, and so on. One that’s particularly useful for HW2 looks like this: '{:08x}'.format(12345)
. If you read the docs, there’s a “replacement field” which consists of several optional things; the last one is delimited by a colon, and is a formatting specification. There’s a whole mini-language there, including many optional parameters. Of particular interest here is the 0
, which means the following value should be zero-prefixed, and then the integer 8
, which says the field should be eight characters long, and then the x
, which is a type
and directs format
to format this value in hex.
But enough about the homework: you should be able to do this one now.
Carving
“Carving” is a generic term for extracting and assembling the “interesting” bits from a larger collection of bits. At a high level, a substantial portion of the technical content we’re going to cover through will be successive applications of carving techniques. The definition of “interesting” will grow more complex, as will the process of extraction and assembly, but the underlying idea will remain the same.
These techniques are very similar to parsing, and we can in fact use the same general family of techniques. In this course, you’ll hand-write most tools yourself, but know that if you take a compilers course (or Prof. Arjun Guha’s version of 220), some of what what we’re going to do will have a lot of overlap with lexers and parsers, which can be automatically generated.
Carving text (ASCII) from files
Suppose we don’t know anything about a file or filetype. In the long term, we might take the time to reverse engineer the file type from both existing data, source code we might have access to, or worst-case binary reverse engineering.
But in the short term, we might try to extract meaningful data from the file. The simplest form of data we might try to pull out is text. How can we do this? A naive algorithm is to read bytes sequentially, outputting each run of bytes that represents valid ASCII text. We might set a minimum length on the runs to help ensure we’re getting valid values, and not just random values.
(Show the strings
state machine.)
The strings
utility is installed on most UNIX machines, and by default extracts ASCII strings from a given input consisting of four or more printable characters. The version of strings
installed determines some of the fiddly behavior, like whether it only considers strings that are NUL or newline terminated.
If you run strings
on a text file, then you just get the lines of that file that contain four or more characters:
# -e is to turn on escape characters ('\n') in my version of `echo`
> echo -e "Hello Marc\nabc\n\nGoodbye Marc" > test.txt
# `cat` sends its input to standard output
> cat test.txt
Hello Marc
abc
Goodbye Marc
> strings test.txt
Hello Marc
Goodbye Marc
The GNU version of strings (installed on my Mac as gstrings
) allows you to search files not just for ASCII, but for general unicode in UTF-8, -16, or -32; for the last two, it lets you specify little or big-endian encodings — these are specified using the -e
option. (More on this topic in a bit.)
> man gstrings
> gstrings -h
Usage: gstrings [option(s)] [file(s)]
Display printable strings in [file(s)] (stdin by default)
The options are:
-a - --all Scan the entire file, not just the data section [default]
-d --data Only scan the data sections in the file
-f --print-file-name Print the name of the file before each string
-n --bytes=[number] Locate & print any NUL-terminated sequence of at
-<number> least [number] characters (default 4).
-t --radix={o,d,x} Print the location of the string in base 8, 10 or 16
-w --include-all-whitespace Include all whitespace as valid string characters
-o An alias for --radix=o
-T --target=<BFDNAME> Specify the binary file format
-e --encoding={s,S,b,l,B,L} Select character size and endianness:
s = 7-bit, S = 8-bit, {b,l} = 16-bit, {B,L} = 32-bit
-s --output-separator=<string> String used to separate strings in output.
@<file> Read options from <file>
-h --help Display this information
-v -V --version Print the program's version number
gstrings: supported targets: mach-o-x86-64 mach-o-i386 mach-o-le mach-o-be mach-o-fat pef pef-xlib sym plugin srec symbolsrec verilog tekhex binary ihex
Report bugs to <http://www.sourceware.org/bugzilla/>
Data validity
How do we know strings extracted in this way are meaningful? In general, we don’t, though there might be a line of inductive reasoning that could apply.
A supporting line of evidence might be argued probabilistically. What are the odds that n sequential bytes are ASCII? p^n, where p is the probability they’re each ASCII. If you assume that each character is generated IID, p = 95/256. (That’s kind of a weird assumption, unless your data source is a random number generator, though.)
That’s for a single sequence; what if you want to ask if by chance alone we found a run of n bytes out of m that were ASCII? There are n - m + 1 such runs. “a run” is the opposite of “no runs”. No runs is
((1-p)^n) ^ (n - m + 1), so consider 1 - that quantity. Or should you? These runs are definitely not IID, since each successive run contains a fraction (n-1)/n of the previous run, in order! Ultimately, we can play probability games for whatever question you want to ask. It’s important, then, to note that things like strings
are best used in a way to help generate hypothesis or to reconstruct unknown file formats, and not generally to (attempt to) form inductive hypotheses.
Carving Unicode / UTF-8
Not all text is ASCII. Recall that Unicode maps characters to code points (numbers), and the various UTF schemes map code points to particular byte encodings. Code points are just hex values, and are often written as U+XXXX, where XXXX is the hex value. So, for example, the code point for £
is 163 (in decimal), 0xa3 (in hex) or U+A3 (or U+00A3) written as a Unicode code point.
Let’s consider how UTF-8 encodes Unicode code points into bytes.
7-bit characters are encoded in a single byte.
11-bit characters are encoded in two bytes.
16-bit characters are encoded in three bytes.
21-bit characters are encoded in four bytes.
How?
Let’s look at 7-bit characters.
0x00 – 0x7F: 0XXXXXXX (where each X is the bit from the character)
0x80 – 0x7FF: 110XXXXX 10XXXXXX
0x800 – 0xFFFF: 1110XXXX 10XXXXXX 10XXXXXX
0x10000 – 0x10FFFF 11110XXX 10XXXXXX 10XXXXXX 10XXXXXX
Take the code point value and write it in binary using the minimum number of bits it will fit in (7, 11, 16, or 21), left-padding with zero bits. Then pack them left-to-right into the patters above, replacing the XXXs with the bits of the code point.
Benefits:
- backward-compatible with ASCCI
- single and multi-byte characters are distinct
- first byte indicates byte sequence length, entails the “prefix property”: no valid sequence is a prefix of any other
- self-sync: single bytes, leading bytes, continuation bytes are distinct, so we can seek to the (potential) next character in either direction trivially
Example:
Consider the £
symbol; we can ask Python for its Unicode value (its code point) using ord
, and we find it’s 163. That’s 0xa3 (binary: 10100011, eight bits), so it’s going to be encoded in 11 bits as a two-byte UTF-8 value. What value?
bin(163)
= 0b10100011; we are going to pack it into 11 bits, so adding padding zeros on the left: 000 1010 0011
Let’s pack the bytes appropriate:
110XXXXX 10XXXXXX
11000010 10100011
Now let’s ask Python what these two values are:
0b11000010 = 194
0b10100011 = 163
OK. What’s the actual encoding of £
, assuming Python does the right thing?
'£'.encode('utf-8')
# -> b'\xc2\xa3'
If you know how to read raw byte encodings, that looks good. Breaking it down:
u = '£'.encode('utf-8')
u[0], u[1]
# -> (194, 163)
Note that there are a few valid-seeming UTF-8 byte sequences that are, per the standard, not considered valid or are not valid all the time. See Wikipedia https://en.wikipedia.org/wiki/Unicode under the discussion about low and high surrogates, noncharacters, reserved and private-use codes, etc.
So, how can you tell if text is valud UTF-8? Try to decode it! If you can, it probably is. Now, whether it’s semantically meaningful or not is a different story.
UTF-16
Just a few words about UTF-16.
Like UTF-8, it’s an encoding of Unicode code points (numbers) into bytes. Like UTF-8, it’s variable-width. Unlike UTF-8, each “unit” is 16-bits (two bytes) wide, so you have endianness to consider. But like UTF-8, it encodes some of Unicode directly. UTF-8 encodes the first 128 code points (U+00 – U+7F) directly into one byte; UTF-16 encodes U+0000 – U+D7FF and U+E000 – U+FFFF directly to two bytes, and uses an four-byte (two sets of two-byte pairs) encoding for code points outside this range.
Notably, this means the ASCII subset of Unicode is encoded as sequences of alternating zero bytes and ASCII characters. For example, an ASCII encoding of “Marc” would be 4D 61 72 63
; in UTF-16, it would read 00 4D 00 61 00 72 00 63
(and in little endian UTF-16, 4D 00 61 00 72 00 63 00
).
Writing a “discount” UTF-16 ASCII string extractor is pretty straightforward as a result.
To help with endianness, UTF-16 has a “byte-order mark.” The BOM is U+FEFF, which is the “non-breaking zero-width space” character, which is inserted at the top/front of UTF-16 data. If missing, the standard says to assume big-endian, though many Windows applications and APIs assume little-endian. You can also specify the encoding as UTF-16BE / UTF-16LE and omit the BOM (for example, if you are using HTTP, you can set the encoding in the Content-Type
header, typically something like Content-Type: text/html; charset=utf-8
).
GNU strings
can be told which encoding to use.
Using external knowledge for encoding
Usually you know something about the data you’re searching through. If so, use that knowledge. Or just try them all, I suppose.
The following is the programmer folk wisdom for determining unknown input’s format:
-
If it has a BOM, use the BOM to identify it as UTF-8, UTF-16 or UTF-32. This is the only path in this set of rules that will ever identify a file as UTF-16 or UTF-32, interestingly enough.
-
If it contains only ASCII characters, identify it as ASCII.
-
If it contains ASCII plus 0xA0 (nbsp), supposedly a semi-common case, identify it as ISO Latin 1 (aka ISO-8859-1), similar to but different from Windows-1252.
-
If none of the above match, run a UTF-8 prober. This is usually overkill, because UTF-8 has a very distinctive and strict structure, and you can usually identify it simply by asking: “Does this contain at least one UTF-8 character (if not, exit early) and if so, does it parse as valid UTF-8?” If true, it’s generally reasonable to assume UTF-8.
In general data recovery scenarios you can use uchardet
, which uses a slightly more sophisticated algorithm and can correctly identify many of the pre-unicode text encodings.
Note that you can’t (or at least, shouldn’t) use uchardet
or the like on general binary data; only on data you know (or suspect) to consist of text.
Carving files out of files: JPEGs in DOCs
(A sample of things to come!)
JPEG is an encoding; JFIF is the file format.
Interestingly, the start and end of the image stored in JFIF are marked with a particular sequence of bytes (0xFFD8 / 0xFFD9). And, if we read about JPEG, we see that any byte 0xFF has a 0x00 byte appended, to prevent “framing errors,” in other words. More generally, 0xFF bytes in a “.jpg” (that is, a JFIF containing a JPEG) are used to denote something special, as we’ll see later.
But for now, what is the implication of the start and end sequences (0xFFD8 / 0xFFD9)? Well, taking in combination with the fact that you can “embed” JPEGs in most file formats, it means we can carve (and recover) the original JPEG from a file where it’s been embedded. We look for each 0xFFD8 followed by a 0xFFD9. We then write all the data between each pair of markers (including the markers) into their own file.
(Demo)
What else might the 0xFF bytes encode? Remember EXIF? We’ll probably get to some of these details next class.