04: Strings

Estimated time to complete: three hours (or less, if you are experienced with Python and bit manipulations)

Working with UTF-encoded data is a fine way to learn some of the basics of manipulating bits and bytes. In this assignment, you’ll write your own (simplified) UTF-8 encoder and decoder. Separately, you’ll write a version of strings that can handle subsets of UTF-8 and UTF-16.

Manipulating bits

As we talked about in class, UTF-8 is an encoding from a Unicode code point to sequence of one or more bytes.

You’re going to write an encoder that given an integer value representing a character’s Unicode code point value, outputs a “bytes” object representing that character as UTF-8. You’re then going to write a decoder that given a bytes object representing a UTF-8 encoded Unicode value, returns the numeric value of the corresponding codepoint.

You should not worry about carefully filtering valid/invalid characters, like the “invalid encodings” described in the Wikipedia article for UTF-8: Just (naively) encode/decode the codepoints/bytes that are passed to your functions.

Python already has the tools to do this task, though I want you to do it manually using bitwise operations. In other words, your encoder will be equivalent to:

def encode(codepoint):
    # encode takes a codepoint (an `int`) and returns a `bytes` object
    return chr(codepoint).encode()

def decode(bytes_object):
    # decode takes a `bytes` object and returns a codepoint (an `int`)
    return ord(bytes_object.decode())

but you may not use str.encode() or bytes.decode, directly or indirectly, in your solution. Indirect use includes, for example, calling bytes(chr(codepoint)), which invokes str.encode() indirectly per its documentation.

What do you do instead? You follow the encoding scheme we showed in class and that’s pretty well described in Wikipedia. “But, that requires I be able to move bits around!” Yes, yes it does.

The fundamental operations you’ll need are “bitwise and” (in Python, the operator &), “bitwise or” (|), and bit shifting left (<<) and right (>>). (See, for example, https://docs.python.org/3/reference/expressions.html#binary-bitwise-operations).

For example, suppose I wanted to extract the middle four bits (out of one byte) of the number 222, and put them at the front of a byte that ends with four set bits.

destination = 0b1111
source = 222
print(bin(222)) # just to see it, don't actually manipulate strings!
# => 0b11011110

mask = 0b00111100 
# you can write masks (or any values, really!) in hex if you want: 
# 0b00111100 == 0x3c == 60

# extract the middle four bits; all others are set to zero
bits = source & mask
print(bin(bits))
# => 0b11100 # notice the leftmost zeros are elided

# shift them into position; we want them in the front four, so we need to
# move them over two
bits = bits << 2
print(bin(bits))
# => 0b1110000

# now put them into the destination:
destination = destination | bits
print(bin(destination))
# => 0b1111111

So you can build up each byte this way. How do you turn a list of bytes into a bytes object? Using the bytes constructor, for example:

l = [77, 97, 114, 99]
bytes_object = bytes(l)
# now bytes_object is a bytes object with the bytes corresponding to the values in l

I’m so nice (ha-ha) that I’ll even give you part of the solution. Here’s how you might encode codepoints in the ASCII range:

def encode(codepoint):
    if codepoint < 128:
        return bytes([codepoint])

and here’s how you might decode them:

def decode(bytes_object):
    if len(bytes_object) == 1:
        return bytes_object[0]

Of course, UTF-8 encodings that result in multiple bytes are going to require some bit twiddling to create the list of bytes you convert into a bytes object, or to extract the bits from the multiple bytes.

Make sure you are actually working on the values themselves. Do not call bin() and use string manipulation on the strings! Work on values of type int.

Carving strings

strings generally extracts and prints on a separate line each sequence of n or more printable ASCII values from its input — the exact behavior various depending upon the version of strings you have. On the EdLab, strings is GNU strings.

GNU strings takes an optional argument -n min-len, indicating the value of n. It also takes an optional argument -e encoding, indicating the encoding that should be checked. s is essentially UTF-8 printable ASCII; b and l are essentially big- and little-endian UTF-16 printable ASCII.

You’re going to implement strings in Python, treating the Unicode code points between U+20 and U+7E, inclusive, as printable. Just like in strings, each string of the required length (or greater) will be printed to standard out on its own line.

For simplicity’s sake: Assume all UTF-16 strings are even-byte aligned. That is, you should assume they start on offsets (from the start of the file) that are divisible by two. In practice this is not the case, but allowing odd- and even-aligned strings leads to ambiguities that make autograding a hassle. Be aware that a real strings implementation needs to scan byte-by-byte, but for this assignment, yours does not.

Note you do not need to use your own encoder and decoder from earlier in this assignment; you may use Python built-ins to determine if a given byte (or pair of bytes, in the case of UTF-16) represents a character of interest.

590F

If you are enrolled in 590F, your strings.py must take an additional option -x. If -x is present on the command line, for example, python3 strings.py -x file.dat, your program must attempt to extract all printable Unicode in the assigned Basic Multilingual Plane, not just printable ASCII. That’s actually a pain to specify exactly (lots of reserved unused codepoints in there), so for purposes of this assignment, this means all codepoints that are printable ASCII (U+0020–U+007E), and all codepoints in the range 0+00A1–0+D7FF.

You will almost certainly want to redirect the output of this command to a file, rather than sending it right to your terminal, especially when you’re debugging.

What not to do

Don’t use the built-in str.encode and bytes.decode methods in your codec.py! You can use them when testing, of course, but part of the point this assignment is to practice bit-level manipulations of data.

Do not use bin() in your codec, except in debugging. You should work directly on the underlying values and bits, and not use high-level string manipulations.

It is fine to use chr to convert codepoints (integer values) into Python strings for output in your strings.py.

And, your strings.py (but not your codec.py!) may use the built-in str.encode and bytes.decode if you so desire.

What to submit

(Several Gradescope items, one for codec.py and one each for 365 / 590F’s strings.py, will go up later.)

Submit two Python files. The first, codec.py should contain encode and decode functions as described above. It must not produce output or have other side effects when simply imported into a running instance of Python. That is, it should have behavior equivalent to:

def encode(codepoint):
    return chr(codepoint).encode()


def decode(bytes_object):
    return ord(bytes_object.decode())


def main():
    pass

if __name__ == '__main__':
    main()

but it will be longer, since it will use your bit-twiddling implementation rather than the one-liners above.

The second, strings.py, should implement the behavior described above for strings. Argument parsing is tedious so we’ve done it for you in the following template, which you are free to use (590F students will need to add an argument to the parser to finish the assignment):

import argparse


def print_strings(file_obj, encoding, min_len):
    # Right now all this function does is print its arguments.
    # You'll need to replace that code with code that actually finds and prints the strings!
    print(file_obj.name)
    print(encoding)
    print(min_len)


def main():
    parser = argparse.ArgumentParser(description='Print the printable strings from a file.')
    parser.add_argument('filename')
    parser.add_argument('-n', metavar='min-len', type=int, default=4,
                        help='Print sequences of characters that are at least min-len characters long')
    parser.add_argument('-e', metavar='encoding', choices=('s', 'l', 'b'), default='s',
                        help='Select the character encoding of the strings that are to be found. ' +
                             'Possible values for encoding are: s = UTF-8, b = big-endian UTF-16, ' +
                             'l = little endian UTF-16.')
    args = parser.parse_args()

    with open(args.filename, 'rb') as f:
        print_strings(f, args.e, args.n)

if __name__ == '__main__':
    main()