04: Strings

Estimated time to complete: three hours (or less, if you are experienced with Python and bit manipulations)

Working with UTF-encoded data is a fine way to learn some of the basics of manipulating bits and bytes. In this assignment, you'll write your own (simplified) UTF-8 encoder and decoder. Separately, you'll write a version of strings that can handle subsets of UTF-8 and UTF-16.

Manipulating bits

As we talked about in class, UTF-8 is an encoding from a Unicode code point to sequence of one or more bytes.

You're going to write an encoder that given an integer value representing a character's Unicode code point value, outputs a "bytes" object representing that character as UTF-8. You're then going to write a decoder that given a bytes object representing a UTF-8 encoded Unicode value, returns the numeric value of the corresponding codepoint.

You should not worry about carefully filtering valid/invalid characters, like the "invalid incodings" described in the Wikipedia article for UTF-8: Just naively encode/decode the codepoints/bytes that are passed to your functions.

Python already has the tools to do this task, though I want you to do it manually using bitwise operations. In other words, your encoder will be equivalent to:

def encode(codepoint):
    return chr(codepoint).encode()

def decode(bytes_object):
    return ord(bytes_object.decode())

but you may not use str.encode() or bytes.decode, directly or indirectly, in your solution. Indirect use includes, for example, calling bytes(chr(codepoint)), which invokes str.encode() indirectly per its documentation.

What do you do instead? You follow the encoding scheme we showed in class and that's pretty well described in Wikipedia. "But, that requires I be able to move bits around!" Yes, yes it does.

The fundamental operations you'll need are "bitwise and" (in Python, the operator &), "bitwise or" (|), and bit shifting left (<<) and right (>>). (See, for example, https://docs.python.org/3/reference/expressions.html#binary-bitwise-operations).

For example, suppose I wanted to extract the middle four bits (out of one byte) of the number 222, and put them at the front of a byte that ends with four set bits.

destination = 0b1111
source = 222
print(bin(222)) # just to see it, don't actually manipulate strings!
# => 0b11011110

mask = 0b00111100 
# you can write masks (or any values, really!) in hex if you want: 
# 0b00111100 == 0x3c == 60

# extract the middle four bits; all others are set to zero
bits = source & mask
print(bin(bits))
# => 0b11100 # notice the leftmost zeros are elided

# shift them into position; we want them in the front four, so we need to
# move them over two
bits = bits << 2
print(bin(bits))
# => 0b1110000

# now put them into the destination:
destination = destination | bits
print(bin(destination))
# => 0b1111111

So you can build up each byte this way. How do you turn a list of bytes into a bytes object? Using the bytes constructor, for example:

l = [77, 97, 114, 99]
bytes_object = bytes(l)

Make sure you are actually working on the values themselves. Do not call bin() and use string manipulation on the strings.

Carving strings

strings generally extracts and prints on a separate line each sequence of n or more printable ASCII values from its input --- the exact behavior various depending upon the version of strings you have. On the EdLab, strings is GNU strings.

GNU strings takes an optional argument -n min-len, indicating the value of n. It also takes an optional argument -e encoding, indicating the encoding that should be checked. s is essentially UTF-8 printable ASCII; b and l are essentially big- and little-endian UTF-16 printable ASCII.

You're going to implement strings in Python, treating the Unicode code points between U+20 and U+7E, inclusive, as printable. Just like in strings, each string of the required length (or greater) will be printed to standard out on its own line.

Note you do not need to use your own encoder and decoder from earlier in this assignment; you may use Python built-ins to determine if a given byte (or pair of bytes, in the case of UTF-16) represents a character of interest.

590F

If you are enrolled in 590F, your strings.py must take an additional option -x. If -x is present on the command line, for example, python3 strings.py -x file.dat, your program must attempt to extract all printable Unicode in the assigned Basic Multilingual Plane, not just printable ASCII. That's actually a pain to specify exactly (lots of reserved unused codepoints in there), so for purposes of this assignment, this means all codepoints that are printable ASCII (U+0020--U+007E), and all codepoints in the range 0+00A1--0+D7FF.

You will almost certainly want to redirect the output of this command to a file, rather than sending it right to your terminal, especially when you're debugging.

What not to do

Don't use the built-in str.encode and bytes.decode methods in your codec.py! You can use them when testing, of course, but part of the point this assignment is to practice bit-level manipulations of data.

Do not use bin() in your codec, except in debugging. You should work directly on the underlying values, and not use high-level string manipulations.

(It is of course fine to use chr to convert codepoints into Python strings for output in your strings.py.)

If you are in 590F, your strings.py may use str.encode and bytes.decode (and you will probably want to). If you are in 365, you won't need to use either, though you may if you so choose.

What to submit

(Note that a Gradescope item will go up later.)

Submit two Python files. The first, codec.py should contain encode and decode functions as described above. It must not produce output when simply imported into a running instance of Python. That is, it should behavior equivalent to:

def encode(codepoint):
    return chr(codepoint).encode()


def decode(bytes_object):
    return ord(bytes_object.decode())


def main():
    pass

if __name__ == '__main__':
    main()

The second, strings.py, should implement the behavior described above for strings. Argument parsing is tedious so we've done it for you in the following template, which you are free to use (590F students will need to add an argument to the parser):

import argparse


def print_strings(file_obj, encoding, min_len):
    print(file_obj.name)
    print(encoding)
    print(min_len)


def main():
    parser = argparse.ArgumentParser(description='Print the printable strings from a file.')
    parser.add_argument('filename')
    parser.add_argument('-n', metavar='min-len', type=int, default=4,
                        help='Print sequences of characters that are at least min-len characters long')
    parser.add_argument('-e', metavar='encoding', choices=('s', 'l', 'b'), default='s',
                        help='Select the character encoding of the strings that are to be found. ' +
                             'Possible values for encoding are: s = UTF-8, b = big-endian UTF-16, ' +
                             'l = little endian UTF-16.')
    args = parser.parse_args()

    with open(args.filename, 'rb') as f:
        print_strings(f, args.e, args.n)

if __name__ == '__main__':
    main()