02: hexdump

Estimated time to complete: two hours (or less, if you are experienced with Python)

A hex dump is a view of data in hexadecimal format. Producing one is akin to the “Hello World” of digital forensics. In this assignment, you’re going to implement a simple hex dump program in Python, using the hexdump program as a reference for how your output should be formatted.

Problem description

We’re going to do a clean-room implementation of the BSD hexdump utility, which is installed as /usr/bin/hexdump on the EdLab (and on OS X, and on most Linux distributions). In particular, you are going to write a Python program named hexdump.py that reproduces the effect of invoking hexdump -Cv filename, which writes a hexdump of the contents of the given filename in Canonical, verbose format to standard output. In other words, entering at the command line:

hexdump -Cv filename

and

python3.5 hexdump.py filename

should result in identical behavior for valid input files. What is that behavior? To quote the manual page (accessible by typing man hexdump at the command line), it should “display the input offset in hexadecimal, followed by sixteen space-separated, two column, hexadecimal bytes, followed by the same sixteen bytes in %_p format enclosed in '|' characters.”

For example, the output of hexdump on the sidebar JPEG file for this site is:

00000000  ff d8 ff e0 00 10 4a 46  49 46 00 01 01 00 00 01  |......JFIF......|
00000010  00 01 00 00 ff db 00 43  00 07 07 09 09 09 09 09  |.......C........|
00000020  09 09 09 09 09 09 09 09  09 09 09 09 09 09 09 09  |................|
...
(many omitted lines)
...
0002d690  12 93 b3 e7 8e 86 62 f5  0a 25 4e 4b f0 7a d0 24  |......b..%NK.z.$|
0002d6a0  df 58 22 1a 3b 37 c5 16  92 a3 c4 b1 68 f7 69 64  |.X".;7......h.id|
0002d6b0  bc 10 7c d1 0d 02 da 5f  07 ea c0 1f ff d9        |..|...._......|
0002d6be

Each full line represents 16 bytes. The first column is the offset (starting at zero; the second line starts at offset 0x00000010, 16 bytes into the file). The middle column is the byte values in hex; note there is an extra space between the eighth and ninth byte. The final column is the the same bytes in so-called perusal format, enclosed in vertical bar (also known as pipe) characters. There are two spaces between each of the three columns, and no spaces after the final character of each line.

Perusal format means that bytes that represent printable ASCII characters are shown as those ASCII characters; all other bytes are replaced by a period ('.'). Which bytes are printable ASCII? Let me Google that for you, but in essence, any value that is between 0x20 and 0x7E, inclusive, is considered printable ASCII.

On the last line of the final column, the final pipe immediately follows the last character, and the final line contains only the offset of the last byte in the file in its first column, but no other information. The only exception is if the input file is empty (zero bytes long), in which case the output of hexdump is empty as well.

For any readable file, your program must produce the correct output and exit without error. Your program need not handle exceptional cases related to a missing filename argument or an un-open-able file; do not bother catching the exceptions.

What to test

Make sure your program produces the correct output for at least the following:

  • empty inputs
  • inputs with fewer than 16 bytes (that is, less than one full line of output)
  • inputs with an exact multiple of 16 bytes
  • inputs with more than one line of output
  • inputs with any/all of the printable ASCII characters
  • inputs with any/all of the non-printable ASCII characters

You can use hexdump to validate your program on test files. For example, suppose you have a file named test1.dat, and you want to make sure your program produces the correct output. Use hexdump and the shell redirect operator ('>') to save its output to a file named test1.dat.expected:

hexdump -Cv test1.dat > test1.dat.expected

Now run your program and save its output to a file test1.dat.output:

python3.5 hexdump.py test1.dat > test1.dat.output

You can check that the two files are identical by eye, or by using the diff utility:

diff test1.dat.expected test1.dat.output

If they are the same, diff will have no output; if they differ, diff will show you the difference(s). If they differ only in whitespace it will be hard to see the difference: you may need to highlight diff‘s output (or copy/paste into an editor) to see the problem. You can pass various arguments to diff to get different output formats, depending upon which version of diff you have installed.

What to submit

Upload your code to Gradescope when you’re read to submit it. Your submission should be a single file named hexdump.py, implementing the required behavior in Python 3.5. You can resubmit the assignment as many times as you want, until the deadline. If it turns out you missed something and your code doesn’t pass 100% of the tests, you can keep working until it does.

What not to do

Don’t try to be cute by invoking /usr/bin/hexdump using Python’s subprocess library or the like. Implement the required behavior in Python yourself.

Suggestions / hints

My solution to this assignment is just under fifty lines long. If you find your solution getting to be much more that double or triple that length, stop and reassess your approach.

In general, follow good programming practice: break things up into meaningful, reusable, independently-testable functions; use meaningful variable and constant names; try to follow language conventions (such as PEP-8); etc.

Don’t be afraid to ask for help on Piazza. Almost certainly a helpful classmate or the course staff will be able to get you unstuck if you’re having difficulty. Of course, you’ll need to ask early enough that someone sees your question and has time to answer before the due date.

If you are new to Python, you may find some of the following suggestions and information helpful.

Use the REPL (the Read Eval Print Loop, that is, the interactive Python interpreter) to quickly test that functions and methods do what you expect.

To get a command-line argument, you read from the list stored in sys.argv. The first argument, which will be the filename, will be stored in sys.argv[1]. Note that sys is only in scope if you import sys at the top of your program.

To open a file as a binary file rather than in text mode, you must pass an argument to open. In particular, use something like open(filename, 'rb') to open the file in binary mode. The idiomatic way to perform IO on a file in Python is something like:

with open(filename, 'rb') as f:
    x = do_something(f)
do_something_else(x)

Above, f is only valid within the with statement’s body, and the file is automatically closed afterward.

open returns a file-like object that you can call read or readall on. When a file is opened in binary mode, calling read returns sequences of bytes. It’s fine to read the entire file into memory, or to read it some number of bytes at a time, whichever works better for you. The latter is more efficient as read is by default buffered, but the former might make your program simpler.

Byte sequences can be sliced and iterated over just like any other sequence (for example, like lists). For example, if data is a sequence of bytes, data[0:8] returns a sequence of eight bytes, the zeroth (inclusive) through the eighth (exclusive).

The print function writes to standard output by default. It terminates with a newline by default; pass a value to the end parameter to change that behavior. For example print('hello') will print the string “hello” to standard output and append a newline, but print('hello', end='') will print the string “hello” to standard output and not append a newline.

While you can use hex to format a byte (or an integer, like, say, the offset) as hex, you’ll run into trouble with left-padding of zeros. You can either pad yourself, or use a “format string” to format values they way you want them. String formatting means using the format method on a string to create other strings. The explanation of the format string mini-language is pretty dry, but the examples make it a little more clear. For our purposes, the important values for the format_spec are going to be fill, width, and type. We will generally be filling with zeros, having a width of either eight (for the offset) or two (for the hex-encoded middle column values), and a type of hex. To format an integer appropriately for the left column, you might use something like '{:08x}'.format(value), that is, a string “{:08x}”, upon which you call the format method, with an argument of the value you want to format. This will return an eight-character wide string, left-padded with zeros, consisting of the passed value formatted as hexadecimal. In your code, this might show up as:

print('{:08x}'.format(offset), end='  ')

if offset was the offset value you were printing and you wanted to include the two spaces between it and the next column.

You can use chr on a byte in the printable ASCII range to convert it to a string.

If you want to concatenate a list of strings, you can use the join method. The join method is called on the string that you want interposed between the strings in your list. For example, in:

x = ['this', 'is', 'a', 'test']
s = ' '.join(x)
print(s)

s is created by joining together all of the strings in x, with a space between each. The output is “this is a test”. You can use join on an empty string (''.join(a_list_of_strings)) to concatenate strings directly (that is, with nothing between them).

You can use the * operator on a string and an integer to create a string consisting of that many repetitions of the string. For example, 'a' * 10 evaluates to the string 'aaaaaaaaaa'. This fact might be helpful when padding strings with either zeros or spaces.