CMPSCI 591N : Computational Linguistics
Spring 2006
Homework #2: String Edit Distance
Out: Thursday February 9, 2006
Due: Thursday February 16, 2006, by 11:59pm, by email to compling@cs.umass.edu
In this homework assignment you will modify, extend or apply the provided Python code for calculating string edit distance, run your new program on data, and write a short report about your experiences. There are several suggested tasks below. You only need do one task. But don't be limited by the list below. I you are free to come up with your own task. The exact assignment is up to your own interests and creativity.
Please check the course Web site syllabus, in the homework column for any updates and clarifications to this assignment.
Python Infrastructure available
Begin with stredit.py, available at http://www.cs.umass.edu/~mccallum/courses/cl2006/code. (This is the module that was demonstrated in class on Tuesday.) You are also welcome to develop your own Python programs from scratch, if you prefer.
You can use the code by typing, (for example, where $ is your command-line prompt):
$ python
>>> import stredit
>>> stredit.stredit2('tom sawyer', 'thomas sawer')
t o m s a w y e r
0 1 2 3 4 5 6 7 8 9 10
t 1 * 0 1 2 3 4 5 6 7 8 9
h 2 * 1 1 2 3 4 5 6 7 8 9
o 3 2 * 1 2 3 4 5 6 7 8 9
m 4 3 2 * 1 2 3 4 5 6 7 8
a 5 4 3 * 2 2 3 3 4 5 6 7
s 6 5 4 * 3 3 2 3 4 5 6 7
7 6 5 4 * 3 3 3 4 5 6 7
s 8 7 6 5 4 * 3 4 4 5 6 7
a 9 8 7 6 5 4 * 3 4 5 6 7
w 10 9 8 7 6 5 4 * 3 * 4 5 6
e 11 10 9 8 7 6 5 4 4 * 4 5
r 12 11 10 9 8 7 6 5 5 5 * 4
4
This file provides two functions: stredit and stredit2. The second keeps track of the alignment and prints the alignment using a * to the left of each aligned table entry. The second doesn't keep track of the alignment, but is shorter and easier to understand. In your assignment you can start with and then modify either one. (Although for debugging and for your experiments, you may want to see the alignment, and thus might prefer to start with the second.)
You can also calculate the distance between word sequences instead of character sequences. For example:
>>> stredit.stredit2(['He', 'quickly', 'ran', 'to', 'the', 'store'], ['He', 'walked', 'to', 'the', 'grocery', 'store'])
He qui ran to the sto
0 1 2 3 4 5 6
He 1 * 0 * 1 2 3 4 5
wal 2 1 1 * 2 3 4 5
to 3 2 2 2 * 2 3 4
the 4 3 3 3 3 * 2 3
gro 5 4 4 4 4 * 3 3
sto 6 5 5 5 5 4 * 3
3
Example Tasks
- (Simplest, with a moderate amount of Python programming, depending on how fancy you get with your character distance function.) Some typing errors are more likely than others because some letters are closer to each other on the keyboard than others. Modify the stredit.stredit() function to account for these differences, thus implementing Needleman-Munch distance instead. You could do this by replacing the if-statement that sets the variable d with different code that calls a new function (of your own creation) that takes two characters and returns an integer--smaller integers for characters close together on the keyboard, larger integers for characters further away. Then experiment with your functionality a little. You could think of this as a component of a spelling correction system: given some string that is mispelled, and several candidate dictionary words to replace it, the string edit function returns the closest candidate. Give an example of a mispelled word and two candidates for which the distance-ordering of the candidates changes when you substitute Levenstein with your new Needleman-Munch distance.
- (Intermediate, with interesting changes to the dynamic programming in the stredit function, and cool new functionality. Not a large amount of new Python programming.) Currently the stredit.stredit() function calculates Levenstein distance. Modify it to calculate Smith-Waterman distance instead, as described on the slides. Instead of trying to align the entirety of the two strings, Smith-Waterman finds an occurrence of string2 within string1, but allows for "soft matches" in that occurrence. For example, given the arguments "Dr. Livingston, I presume" and "Livangsten", this function would find that "Linangsten" is aligned with character offsets 4 through 13 in the first string. If you want to see the alignments, you will need to modify stredit2(), not stredit(), and you will need to change the code that finds the alignment. Note that some of the costs will now be negative. Note also that the trace should not begin at the bottom-right entry in the dynamic programming table--it begins at whichever entry has the miminum value. Experiment with your modifications. What happens as you change the relative magnitude of the "positive" scores for matching (copying) versus the "negative" scores for substitution, insertion and deletion?
- (Very complex! A real challenge requiring deeper understanding of the dynamic programming concept---also perhaps most interesting.) Modify the current Levenstein distance implementation to implement instead an Affine Gap distance calculation. In Affine Gap, repeated deletions or insertions are cheaper than the first deletion or insertion. Implementing this requires modeling a little finite state machine that keeps track of the last edit action. You will also need to change the shape of the dynamic programming table to account for this extra state. Experiment with your new function. Show some examples of strings where the distance values change or the string comparison rankings among several alternative strings change.
- (Not much change to the string edit function, but a bit more Python programming required.) Implement a simple machine translation alignment system. As demonstrated in class, the stredit() function can take sequences of words in addition to sequences of characters. Create (or find on the Web) a little translation dictionary (for example: english2french = {'the':'le', 'the':'la', 'dog':'chien'}, etc), and use this in a new distance function that will create good alignments between a few example sentence pairs in different languages.
- (No change to the string edit function, and therefore less desirable for learning about it; a fair amount of Python programming required.) Implement a spell checker based on string edit distances. Because it doesn't involve much understanding of how string edit distance works, I don't recommend this as the only HW task you do. If you are interested in this, you could do it as a second task. (Note that, in general, only one task is required.)
What to hand in, and how
The homework should be emailed to compling@cs.umass.edu before 11:59pm on Thursday February 18, 2006.
In addition to writing your Python program, write a short report about how you selected the task you chose, your experiences in implementing it, your experiences in running it, and your findings. Feel free to suggest other additional things you might like to to next that build on what you've done so far. This report should be clear, well-written, but needn't be long--one or one-half page is fine. Also, no need for fancy formatting. In fact, we prefer to receive this report as the body of your email. Your program can also be included in the body, or included as an email attachment.
Grading
The assignment will be graded for (a) correctness of your implementation, (b) creativity of your task, implementation, use and analysis, (c) quality/clarity of your written report.
Questions?
Feel free to ask! Send email to compling@cs.umass.edu, or if you'd like your classmates to be able to help answer your question, use compling-class@cs.umass.edu.