Estimated time to complete: three hours (or less, if you are experienced with Python and bit manipulations)
Working with UTF-encoded data is a fine way to learn some of the basics of manipulating bits and bytes. In this assignment, you’ll write your own (simplified) UTF-8 encoder and decoder. Separately, you’ll write a version of
strings that can handle subsets of UTF-8 and UTF-16.
As we talked about in class, UTF-8 is an encoding from a Unicode code point to sequence of one or more bytes.
You’re going to write an encoder that given an integer value representing a character’s Unicode code point value, outputs a “bytes” object representing that character as UTF-8. You’re then going to write a decoder that given a bytes object representing a UTF-8 encoded Unicode value, returns the numeric value of the corresponding codepoint.
You should not worry about carefully filtering valid/invalid characters, like the “invalid encodings” described in the Wikipedia article for UTF-8: Just (naively) encode/decode the codepoints/bytes that are passed to your functions.
Python already has the tools to do this task, though I want you to do it manually using bitwise operations. In other words, your encoder will be equivalent to:
def encode(codepoint): # encode takes a codepoint (an `int`) and returns a `bytes` object return chr(codepoint).encode() def decode(bytes_object): # decode takes a `bytes` object and returns a codepoint (an `int`) return ord(bytes_object.decode())
but you may not use str.encode() or bytes.decode, directly or indirectly, in your solution. Indirect use includes, for example, calling
bytes(chr(codepoint)), which invokes
str.encode() indirectly per its documentation.
What do you do instead? You follow the encoding scheme we showed in class and that’s pretty well described in Wikipedia. “But, that requires I be able to move bits around!” Yes, yes it does.
The fundamental operations you’ll need are “bitwise and” (in Python, the operator
&), “bitwise or” (
|), and bit shifting left (
<<) and right (
>>). (See, for example, https://docs.python.org/3/reference/expressions.html#binary-bitwise-operations).
For example, suppose I wanted to extract the middle four bits (out of one byte) of the number 222, and put them at the front of a byte that ends with four set bits.
destination = 0b1111 source = 222 print(bin(222)) # just to see it, don't actually manipulate strings! # => 0b11011110 mask = 0b00111100 # you can write masks (or any values, really!) in hex if you want: # 0b00111100 == 0x3c == 60 # extract the middle four bits; all others are set to zero bits = source & mask print(bin(bits)) # => 0b11100 # notice the leftmost zeros are elided # shift them into position; we want them in the front four, so we need to # move them over two bits = bits << 2 print(bin(bits)) # => 0b1110000 # now put them into the destination: destination = destination | bits print(bin(destination)) # => 0b1111111
So you can build up each byte this way. How do you turn a list of bytes into a
bytes object? Using the
bytes constructor, for example:
l = [77, 97, 114, 99] bytes_object = bytes(l) # now bytes_object is a bytes object with the bytes corresponding to the values in l
I’m so nice (ha-ha) that I’ll even give you part of the solution. Here’s how you might encode codepoints in the ASCII range:
def encode(codepoint): if codepoint < 128: return bytes([codepoint])
and here’s how you might decode them:
def decode(bytes_object): if len(bytes_object) == 1: return bytes_object
Of course, UTF-8 encodings that result in multiple bytes are going to require some bit twiddling to create the list of bytes you convert into a bytes object, or to extract the bits from the multiple bytes.
Make sure you are actually working on the values themselves. Do not call
bin() and use string manipulation on the strings! Work on values of
strings generally extracts and prints on a separate line each sequence of n or more printable ASCII values from its input — the exact behavior various depending upon the version of
strings you have. On the EdLab,
strings is GNU
strings takes an optional argument
-n min-len, indicating the value of n. It also takes an optional argument
-e encoding, indicating the encoding that should be checked.
s is essentially UTF-8 printable ASCII;
l are essentially big- and little-endian UTF-16 printable ASCII.
You’re going to implement
strings in Python, treating the Unicode code points between U+20 and U+7E, inclusive, as printable. Just like in
strings, each string of the required length (or greater) will be printed to standard out on its own line.
For simplicity’s sake: Assume all UTF-16 strings are even-byte aligned. That is, you should assume they start on offsets (from the start of the file) that are divisible by two. In practice this is not the case, but allowing odd- and even-aligned strings leads to ambiguities that make autograding a hassle. Be aware that a real
strings implementation needs to scan byte-by-byte, but for this assignment, yours does not.
Note you do not need to use your own encoder and decoder from earlier in this assignment; you may use Python built-ins to determine if a given byte (or pair of bytes, in the case of UTF-16) represents a character of interest.
What not to do
Don’t use the built-in
bytes.decode methods in your
codec.py! You can use them when testing, of course, but part of the point this assignment is to practice bit-level manipulations of data.
Do not use
bin() in your codec, except in debugging. You should work directly on the underlying values and bits, and not use high-level string manipulations.
It is fine to use
chr to convert codepoints (integer values) into Python strings for output in your
strings.py (but not your
codec.py!) may use the built-in
bytes.decode if you so desire.
What to submit
(Two Gradescope items, one for
codec.py and one each for
strings.py, will go up later.)
Submit two Python files. The first,
codec.py should contain
decode functions as described above. It must not produce output or have other side effects when simply
imported into a running instance of Python. That is, it should have behavior equivalent to:
def encode(codepoint): return chr(codepoint).encode() def decode(bytes_object): return ord(bytes_object.decode()) def main(): pass if __name__ == '__main__': main()
but it will be longer, since it will use your bit-twiddling implementation rather than the one-liners above.
strings.py, should implement the behavior described above for
strings. Argument parsing is tedious so we’ve done it for you in the following template, which you are free to use:
import argparse def print_strings(file_obj, encoding, min_len): # Right now all this function does is print its arguments. # You'll need to replace that code with code that actually finds and prints the strings! print(file_obj.name) print(encoding) print(min_len) def main(): parser = argparse.ArgumentParser(description='Print the printable strings from a file.') parser.add_argument('filename') parser.add_argument('-n', metavar='min-len', type=int, default=4, help='Print sequences of characters that are at least min-len characters long') parser.add_argument('-e', metavar='encoding', choices=('s', 'l', 'b'), default='s', help='Select the character encoding of the strings that are to be found. ' + 'Possible values for encoding are: s = UTF-8, b = big-endian UTF-16, ' + 'l = little endian UTF-16.') args = parser.parse_args() with open(args.filename, 'rb') as f: print_strings(f, args.e, args.n) if __name__ == '__main__': main()