CMPSCI 585 : Introduction to Natural Language Processing
Fall 2007
Homework #2: String Edit Distance

 

In this homework assignment you will modify, extend or apply the provided Python code for calculating string edit distance, run your new program on data, and write a short report about your experiences. There are several suggested tasks below. You only need do one task. But don't be limited by the list below. I you are free to come up with your own task. The exact assignment is up to your own interests and creativity.

Please check the course Web site syllabus, in the homework column for any updates and clarifications to this assignment.

Python Infrastructure available

Begin with stredit.py, available at http://www.cs.umass.edu/~mccallum/courses/inlp2007/code. (This is the module that was demonstrated in class on Tuesday.) You are also welcome to develop your own Python programs from scratch, if you prefer.

You can use the code by typing, (for example, where $ is your command-line prompt):

$ python
>>> import stredit
>>> stredit.stredit2('tom sawyer', 'thomas sawer')
          t   o   m       s   a   w   y   e   r
      0   1   2   3   4   5   6   7   8   9  10
  t   1 * 0   1   2   3   4   5   6   7   8   9
  h   2 * 1   1   2   3   4   5   6   7   8   9
  o   3   2 * 1   2   3   4   5   6   7   8   9
  m   4   3   2 * 1   2   3   4   5   6   7   8
  a   5   4   3 * 2   2   3   3   4   5   6   7
  s   6   5   4 * 3   3   2   3   4   5   6   7
      7   6   5   4 * 3   3   3   4   5   6   7
  s   8   7   6   5   4 * 3   4   4   5   6   7
  a   9   8   7   6   5   4 * 3   4   5   6   7
  w  10   9   8   7   6   5   4 * 3 * 4   5   6
  e  11  10   9   8   7   6   5   4   4 * 4   5
  r  12  11  10   9   8   7   6   5   5   5 * 4
  4

This file provides two functions: stredit and stredit2. The second keeps track of the alignment and prints the alignment using a * to the left of each aligned table entry. The second doesn't keep track of the alignment, but is shorter and easier to understand. In your assignment you can start with and then modify either one. (Although for debugging and for your experiments, you may want to see the alignment, and thus might prefer to start with the second.)

You can also calculate the distance between word sequences instead of character sequences. For example:

>>> stredit.stredit2(['He', 'quickly', 'ran', 'to', 'the', 'store'], ['He', 'walked', 'to', 'the', 'grocery', 'store'])
         He qui ran  to the sto
      0   1   2   3   4   5   6
 He   1 * 0 * 1   2   3   4   5
wal   2   1   1 * 2   3   4   5
 to   3   2   2   2 * 2   3   4
the   4   3   3   3   3 * 2   3
gro   5   4   4   4   4 * 3   3
sto   6   5   5   5   5   4 * 3
3

Example Tasks

What to hand in, and how

The homework should be emailed to cs585-staff@cs.umass.edu.

In addition to writing your Python program, write a short report about how you selected the task you chose, your experiences in implementing it, your experiences in running it, and your findings. Feel free to suggest other additional things you might like to to next that build on what you've done so far. This report should be clear, well-written, but needn't be long--one or one-half page is fine. Also, no need for fancy formatting. In fact, we prefer to receive this report as the body of your email. Your program can also be included in the body, or included as an email attachment.

Grading

The assignment will be graded for (a) correctness of your implementation, (b) creativity of your task, implementation, use and analysis, (c) quality/clarity of your written report.

Questions?

Feel free to ask! Send email to cs585-staff@cs.umass.edu, or if you'd like your classmates to be able to help answer your question, use cs585-class@cs.umass.edu.