CMPSCI 591N : Introduction to Natural Language Processing
Fall 2007
Homework #1: Regular Expressions

 

In this homework assignment you will write one or several regular expressions, run your expressions on data, and write a short report about the results. There are several suggested tasks below. You only need do one task. But don't be limited by the list below! I encourage you to come up with your own task! The exact assignment is up to your own interests and creativity!

Please check the course Web site syllabus, in the homework column, and one its homepage, for any updates and clarifications to this assignment.

Example Data available

The data used in class on Tuesday is available at http://www.cs.umass.edu/~mccallum/courses/inlp2007/data. Most of these were obtained from Project Gutenberg (which you can find with a Google query). You are welcome to obtain additional data from there. The Wall Street Journal text came from ftp://ftp.cis.upenn.edu/pub/chunker.

You are also welcome to find your own data in the Web, run on your own email, or any other large corpus you find.

Python Infrastructure available

Feel free to use the regexs.py and regexcount.py that were demonstrated in class. They are both available at http://www.cs.umass.edu/~mccallum/inlp2007/code. You are also welcome to develop your own Python programs, if you prefer and feel able.

Run them by typing, for example,
python regexs.py 'that that' *.txt
or
python regexcount.py '\bdis(\w+)\b' *.txt
at your machine's command prompt.

Example Tasks

Or, also feel free to consider some more involved (and fun!) variants:

What to hand in, and how

The homework should be emailed to cs585-staff@cs.umass.edu.

In addition to writing your regular expression or program, write a short report about your motivations for the task you chose, your experiences in implementing it, your experiences in running it, and your findings. Feel free to suggest other additional things you might like to to next that build on what you've done so far. This report should be clear, well-written, but needn't be long--about one page is fine. Also, no need for fancy formatting. In fact, you can just type this report as the body of your email. Your regular expression or program can also be included in the body, or included as an email attachment.

Grading

The assignment will be graded for (a) creativity of your selected task, (b) good use of regular expressions, (c) quality of your analysis of the results, (d) quality/clarity of your written report.

Questions?

Feel free to ask! Send email to cs585-staff@cs.umass.edu.