05: int to binary
Carving files out of files: JPEGs in DOCs
To be pedantic for a minute, note that when we store a image using JPG, we always say that it’s a JPEG file. But in reality, JPEG is an encoding of an image. The JPEG encoding of an image is stored in a JFIF file format.
The start and end of the image stored in JFIF are marked with a particular sequence of bytes (0xFFD8 / 0xFFD9).
One can “embed” JPEGs in most file formats, it means we can carve (and recover) the original JPEG from a file where it’s been embedded. We look for each 0xFFD8 followed by a 0xFFD9. We then write all the data between each pair of markers (including the markers) into their own file.
[Demo of carving JPEGs embedded in a pdf document.]
File formats
File formats are way to specify how information is arranged on disk. (More generally, data structure formats can apply to memory, too; and recovering evidence from memory works the same way but can be more complicated. It’s a topic we won’t cover.)
Formats can be textual (which just means printable ASCII) or binary. To read text formats, we can just open them up (or write a parser). To read binary, we can view them in a hex editor, or we can write a parser.
Textual file formats
HTML and JSON are file formats most of you are familiar with.
HTML separates content from its markup “tags”. All html tags are enclosed in pairs of angle brackets ‘<>’. The meaning of each tag is its semantics. Here’s some simple HTML for bolding a word:
The last word of this sentence is in <b>bold</b>.
That is diplayed in your browser like so:
The last word of this sentence is in bold.
To extract the tags, you can imagine that we could write a program that would linearly scan through an HTML file, starting in “text” mode. Capture as you go, watching for open brackets, and switching into “tag” mode when one is encountered. Then switch back to “text” mode when you see the close bracket.
Problems? Sure. You need to read the spec to understand about comments. And what about incorrect (technically invalid) formatted files? Happens all the time in textual mode (since humans are directly editing them). Should you fail or attempt to recover? But you get the idea.
JSON is similar, but more complicated. For example, here’s some code from the autograder specifying part of the assignment.
{
"type": "single",
"file": "strings.py",
"wpo_tests": [
{
"score": 5,
"name": "simple UTF-8 length 2",
"command": "python3.11 strings.py -n 2 simple-utf8.txt",
"expected": "simple-utf8.txt.2.expected"
},
...
I haven’t told you all there is know about JSON… but still, I’m sure you can see the basic structure: curly braces are start-of-dictionary and colons separate keys from values; quotes are start-of-string; integers are literals, square braces are start-of-list, commas separate items, and so on.
This isn’t a compilers class so we won’t go into detail here, but there’s a simple grammar that can be represented in various ways (see http://json.org/ for two: a visual representation and a chart of tokens). Again, there are standard techniques to build general parsers based upon an input grammar, but we’re going to hand-write our parsers / carvers in this class.
Dividing data
In both JSON and HTML (and indeed, in most text-based, easily-human-readable formats) there’s widespread use of delimiters, marker characters or strings that are used to show where one element ends and another begins. A very common older use of this is so-called “NUL terminated strings”, where ASCII (or perhaps UTF-8) data is stored starting in a known location, and continues until it stops. How do you know it stops? There’s a NUL (0x00) byte after it.
NUL-terminated strings come from the land of the C programming language, and they are the source of many potential problems in programs (for example, what if you forget to write the NUL byte? Or what if it gets overwritten in a fixed-size array in C?).
Anyway, markers are generally convenient and readable, but there are other approaches that have other benefits (and drawbacks).
One approach is to use a fixed amount of space for an element. Sometimes this is called using a “fixed-width” field, and in binary formats, width is usually measured in bytes. Essentially, the format hard-codes something into place. Like, “the next four bytes will be an unsigned int representing the street number” or the like. When one or more fixed-width fields are present, the programmer knows exactly how far to seek ahead to access any particular one.
Another approach is to explicitly embed something about the length of a field into another field. Usually called “length” or “size” fields, these fields represent either the length of another field, or the number of records (of one or more field) that follow (where the latter are usually fixed-size).
Extracting binary data
As I’ve pointed out before, JPEGs contain metadata within them that give us properties such as which camera captured the image. JPG was introduced as a standard more than 30 years ago. Even ten years ago, consumer devices were capturing images with very low resolution compared to today, typically with handheld cameras based on resource-poor microcontrollers. Today, iphone and android devices capture images and video that are enormous and do so with CPUs and software languages that are way up at the application level. From today’s point of view, the exif data stored inside a jpeg is so very tiny in length compared to the image/video itself. Like the content of a jpg, the exif data is packed together in a terse binary format. Here’s the start of a JPEG (we can see ff d8 at the top), and you can see in ascii the start of the “Exif” portion of the jpg.
00000000 ff d8 ff e0 00 10 4a 46 49 46 00 01 01 00 00 48 |......JFIF.....H|
00000010 00 48 00 00 ff e1 04 dc 45 78 69 66 00 00 4d 4d |.H......Exif..MM|
00000020 00 2a 00 00 00 08 00 09 01 0f 00 02 00 00 00 06 |.*..............|
00000030 00 00 00 7a 01 10 00 02 00 00 00 09 00 00 00 80 |...z............|
00000040 01 1a 00 05 00 00 00 01 00 00 00 8a 01 1b 00 05 |................|
00000050 00 00 00 01 00 00 00 92 01 28 00 03 00 00 00 01 |.........(......|
00000060 00 02 00 00 01 31 00 02 00 00 00 06 00 00 00 9a |.....1..........|
00000070 01 32 00 02 00 00 00 14 00 00 00 a0 87 69 00 04 |.2...........i..|
00000080 00 00 00 01 00 00 00 b4 88 25 00 04 00 00 00 01 |.........%......|
00000090 00 00 03 d2 00 00 00 00 41 70 70 6c 65 00 69 50 |........Apple.iP|
000000a0 68 6f 6e 65 20 35 00 00 00 00 00 48 00 00 00 01 |hone 5.....H....|
...
While some of it is readable, such as the use of an Apple iPhone, our human brains are not soaking up the info easily. I’d be willing to be a shiny new phone that if the jpg standard were put together today, exif would be a JSON file.
But even if handheld cameras have been replaced by smart phones, cheap microcontroller-based devices are proliferating all over our houses and cities (also called “IOT” or “internet of things” devices). And those cheap devices tend to write data in this same terse binary as exif. So it’s important that we learn how to extract this data from a file. Plus, we can do it with python.
What does data look like when it’s stored as binary?
We’ll answer this question by an example. Let’s say I have a bunch of LED lights in a building, and they each have a series of boolean characteristics: powered on/off; warm/daylight color; room number; watts. In JSON that may look like this
lights = [[True, True, 1, 18],[False, True, 2, 13],[True,True,3,13],[True,False,4,18]]
You can see the json as it would be stored in a file using json.dumps(lights)
of course first importing the json library. (Pro-tip: The “s” in “dumps()” stands for “string”. So you could read that function name as “dump s”.)
[[true, true, 1, 18],
[false, true, 2, 13],
[true, true, 3, 13],
[true, false, 4, 18]
]
Literally the word “true” is written out, as well as the spaces and commas.
I can store the same data in a binary format. It packs down quite nicely and makes it hard for humans to comprehend, but we are here to serve the machines, and I for one welcome our new machine overlords.
Let’s do this in python. You can’t write less than a byte, which is a waste for those little 1-bit booleans, but later I’ll show you a trick for that. For now, we’ll just waste the bits.
bytes versus strings in python
If we prepend a string with a b
then the value between the two single quotes is a python bytes object. It’s an array of bytes, and it is akin to a str
object, which is an array of unicode characters.
In other words, if you didn’t have the b
then between the quotes is unicode (UTF-16). Let’s ask python if that’s true:
>>> type('Hello!')
<class 'str'>
>>> type(b'Hello!')
<class 'bytes'>
You can concatenate a bytes array. But please realize that str and bytes are different types, and you can’t concatenate them together no more than you could concatenate a str and int.
>>> 'Hello!'+ "Goodbye!"
'Hello!Goodbye!'
>>> b'Hello!'+ b"Goodbye!"
b'Hello!Goodbye!'
>>> "Hello" + 6
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: can only concatenate str (not "int") to str
>>> 'Hello!'+ b"Goodbye!"
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: can only concatenate str (not "bytes") to str
You can be explicit about any class in python.
>>> sentence = bytes()
>>> sentence += b'Hello!'
>>> sentence += b" Goodbye!"
>>> sentence
b'Hello! Goodbye!'
There is an important behavior to point out about bytes(). If you slice off a single item, the resulting values are numeric (integers). In contrast, if you slice of an item from a str, it’s still a string.
>>> word = 'hello'
>>> word[0]
'h'
>>> word[1:]
'ello'
>>> type(word[0])
<class 'str'>
>>> type(word[1:])
<class 'str'>
>>> word = b'hello'
>>> word[0]
104
>>> word[1:]
b'ello'
>>> type(word[0])
<class 'int'>
>>> type(word[1:])
<class 'bytes'>
Why use bytes
instead of str
? Because I want to store data as values and not as unicode. Do you see that JPEG hexdump above? As a str
, it’s a bunch of unicode. That hexdump doesn’t represent no stinking unicode. Don’t do the following:
>>> dont_do_this= str('ff d8 ff e0 00 10 4a 46 49 46 00 01 01 00 00 48')
>>> dont_do_this
'ff d8 ff e0 00 10 4a 46 49 46 00 01 01 00 00 48'
>>> dont_do_this[0]
'f'
>>> bin(dont_do_this[0])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'str' object cannot be interpreted as an integer
But if convert it to a bytes
object, now you are talking my language. You can convert even with extra spacing between the bytes:
>>> data = bytes.fromhex('ff d8 ff e0 00 10 4a 46 49 46 00 01 01 00 00 48')
>>> data
b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x01\x00\x00H'
>>> data[0]
255
>>> hex(data[0])
'0xff'
>> f"{data[0]:#b}" #same as bin(data[0])
'0b11111111'
We know that 0x
tells python that each character that follows is in base 16 (hex). Now what on earth does \x
mean? It means that the next two characters are each a hex nibble (i.e., four bits each). You can use 0x with one character or more, but the \x is for exactly two hex nibbles, no more no less. 0x
is a numeric value and can’t be used inside quotes for str()
or bytes()
; whereas \x
can be used only inside quotes for str()
or bytes()
objects.
>>> 'a'
'a'
>>> '\x00'
'\x00'
>>> 'a
File "<stdin>", line 1
'a
^
SyntaxError: EOL while scanning string literal
>>> \x00
File "<stdin>", line 1
\x00
^
SyntaxError: unexpected character after line continuation character
You should look at the output of python for each line below and explain to yourself, given the types, why python did what it did. I would spend a bunch of time examining this output. Notice that a bytes() value will convert whatever integers it can to printable ascii. Hence the first example shows a # character as that is the 0x23rd (35th in base 10) character in the ASCII table.
>>> b'\x23'
b'#'
>>> '\x23'
'#'
>>> 0x23
35
>>> '0x23'
'0x23'
>>> type(b'\x23')
<class 'bytes'>
>>> type('\x23')
<class 'str'>
>>> type(0x23)
<class 'int'>
>>> type('0x23')
<class 'str'>
>>> b'\x23'+1
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: can't concat int to bytes
>>> '\x23'+1
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: can only concatenate str (not "int") to str
>>> 0x23+1
36
>>> '0x23'+1
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: can only concatenate str (not "int") to str
>>> b'\x23'+\x01
File "<stdin>", line 1
b'\x23'+\x01
^
SyntaxError: unexpected character after line continuation character
>>> b'\x23'+0x01
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: can't concat int to bytes
>>> '\x23'+0x01
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: can only concatenate str (not "int") to str
>>> 0x23+0x01
36
>>> '0x23'+0x01
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: can only concatenate str (not "int") to str
Writing and Reading binary data
In the assigned reading, we introduced the notion of bit and little endian and the lengths of different storage types: char, short, int (long), and long long.
About 90% of the time, we are going to read in binary data that is actually an integer value. The easiest way to do that is to use the bytes()
or int()
classes to create binary data, and we will use the int.from_bytes()
function call to convert it back to integers.
Here are a ton of examples.
int.to_bytes(*, length, byteorder, signed)
# convert int to a unsigned char (1 byte) big endian
>>> int.to_bytes(0x1,length=1,byteorder="big",signed=False)
b'\x01'
>>> int.to_bytes(1,length=1,byteorder="big",signed=False)
b'\x01'
# convert int to a signed char (1 byte) big endian
>>> int.to_bytes(0x2,length=1,byteorder="big",signed=True)
b'\x02'
# convert int to a unsigned short (2 bytes) big endian
>>> int.to_bytes(0x0304,length=2,byteorder="big",signed=False)
b'\x03\x04'
# convert int to a unsigned short (2 bytes) little endian
>>> int.to_bytes(0x0305,length=2,byteorder="little",signed=False)
b'\x05\x03'
# convert int to a signed short (2 bytes) big endian
>>> int.to_bytes(-10,length=2,byteorder="big",signed=True)
b'\xff\xf6'
# convert int to a unsigned int (4 bytes)
>>> int.to_bytes(0x06070809,length=4,byteorder="big",signed=False)
b'\x06\x07\x08\t'
# convert int to a signed int (4 bytes).
# In output, \r is carriage return, \n is newline
>>> int.to_bytes(0x0A0B0C0D,length=4,byteorder="little",signed=True)
b'\r\x0c\x0b\n'
# convert int to a unsigned long long (8 bytes)
>>> int.to_bytes(0x1E1F2021222324,length=8,byteorder="big",signed=False)
b'\x00\x1e\x1f !"#$'
# You don't really need to convert character strings of bytes.
>>> b'Hello'
b'Hello'
# What if you want to convert a float or double to binary?
# Then you have to use the more complicated struct() library.
>>> # write a single float (4 bytes) little endian
>>> struct.pack("<f", 0.25)
b'\x00\x00\x80>'
>>> # write a single double (8 bytes) big endian
>>> struct.pack(">d", 0.5)
b'?\xe0\x00\x00\x00\x00\x00\x00'
Let’s read in some values. Below, I’m assuming that I know the stored format: big or little endian, and the size of the values (char, short, int, long, or long long).
# int.from_bytes(bytes, byteorder, signed=False)
>>> int.from_bytes(b'\x01','big',signed=False)
1
>>> int.from_bytes(b'\x02','big', signed=True)
2
>>> int.from_bytes(b'\x03\x04','big', signed=False)
772
>>> int.from_bytes(b'\x05\x03','little', signed=False)
773
>>> int.from_bytes(b'\xff\xf6','big', signed=True)
-10
>>> int.from_bytes(b'\x06\x07\x08\t','big', signed=False)
101124105
>>> int.from_bytes(b'\r\x0c\x0b\n','little', signed=True)
168496141
>>> int.from_bytes(b'\x00\x1e\x1f !"#$','big', signed=False)
8478472156619556
We can save this data to a file, and read it in again. Just make sure you tell python to write a binary file. What does this with
command do? It runs a final command (when the indented block ends) that is predefined for whatever type you give it. For opening a file, the predefined command is close()
.
# write to a file
>>> with open('data.bin','wb') as outfile:
data = b'\x06\x07\x08\t' + b'\r\x0c\x0b\n' + b'\x00\x1e\x1f !"#$'
outfile.write(data)
Here’s what the looks like on disk
elnux3:~> hexdump -Cv data.bin
00000000 06 07 08 09 0d 0c 0b 0a 00 1e 1f 20 21 22 23 24 |........... !"#$|
00000010
Now let’s read it back in and parse.
# read from a file
>>> with open('data.bin','rb') as infile:
... my_bytes = infile.read()
...
>>> value1 = int.from_bytes(my_bytes[0:4],'big', signed=False)
>>> value2 = int.from_bytes(my_bytes[4:8],'little', signed=True)
>>> value3 = int.from_bytes(my_bytes[8:16],'big', signed=False)
>>> print(value1, hex(value1))
101124105 0x6070809
>>> print(value2, hex(value2))
168496141 0xa0b0c0d
>>> print(value3, hex(value3))
8478472156619556 0x1e1f2021222324
Bit twiddling in Python
Great. Now we can store data in a file, and we know how to read it back in. But there is still another thing we need to learn: how to manipulate the bits within a byte or sequence of bytes. When we get to filesystems, you’ll be doing some bit twiddling – that is, manipulation of bytes at the bit level.
For those of you who aren’t fresh out of COMPSCI 230, here’s a quick recap on bit twiddling.
The fundamental operations you’ll need are “bitwise and” (in Python, the operator &
), “bitwise or” (|
), and bit shifting left (<<
) and right (>>
). Some more information here: https://docs.python.org/3.9/reference/expressions.html#binary-bitwise-operations.
For example, suppose I wanted to extract the middle four bits (out of one byte) of the number 222, and put them at the front of a byte that ends with four set bits.
# i'm using leading zeros so that you can follow more easily
>>> destination = 0b00001111
>>> source = 222
>>> f"{source:08b}" # This line is just to see the binary. Don't you go and actually manipulate strings!
'11011110'
# Create a mask that has the bits we want set (one), and all others cleared (zero)
>>> mask = 0b00111100
# you can alternatively write masks in hex if you want:
# 0b00111100 == 0x3c == 60
# use the bitwise AND
>>> bits = source & mask
>>> f"{bits:08b}"
'00011100'
# shift them into position; we want them in the front four, so we need to
# move them over two
>>> bits = bits << 2
>>> f"{bits:08b}"
'01110000'
# now put those bits into the destination with bitwise OR
>>> destination = destination | bits
>>> f"{destination:08b}"
'01111111'
Again, you should not be operating on the string representation of the bytes! In other words, don’t call bin()
then manipulate the resulting value of type string
! You should operate on the underlying byte (which will be of type int
). It’s orders of magnitude more efficient and the only reasonable way to do things in real bit twiddling code. To convert a value to a string of zeros and ones and then work with the ascii characters to allow for bit manipulation is just wrong on many levels. It’s important that you know how to do this for the assignments to come.