CMPSCI 591N : Introduction to Natural Language Processing
Fall 2007
Homework #1: Regular Expressions


In this homework assignment you will write one or several regular expressions, run your expressions on data, and write a short report about the results. There are several suggested tasks below. You only need do one task. But don't be limited by the list below! I encourage you to come up with your own task! The exact assignment is up to your own interests and creativity!

Please check the course Web site syllabus, in the homework column, and one its homepage, for any updates and clarifications to this assignment.

Example Data available

The data used in class on Tuesday is available at Most of these were obtained from Project Gutenberg (which you can find with a Google query). You are welcome to obtain additional data from there. The Wall Street Journal text came from

You are also welcome to find your own data in the Web, run on your own email, or any other large corpus you find.

Python Infrastructure available

Feel free to use the and that were demonstrated in class. They are both available at You are also welcome to develop your own Python programs, if you prefer and feel able.

Run them by typing, for example,
python 'that that' *.txt
python '\bdis(\w+)\b' *.txt
at your machine's command prompt.

Example Tasks

Or, also feel free to consider some more involved (and fun!) variants:

What to hand in, and how

The homework should be emailed to

In addition to writing your regular expression or program, write a short report about your motivations for the task you chose, your experiences in implementing it, your experiences in running it, and your findings. Feel free to suggest other additional things you might like to to next that build on what you've done so far. This report should be clear, well-written, but needn't be long--about one page is fine. Also, no need for fancy formatting. In fact, you can just type this report as the body of your email. Your regular expression or program can also be included in the body, or included as an email attachment.


The assignment will be graded for (a) creativity of your selected task, (b) good use of regular expressions, (c) quality of your analysis of the results, (d) quality/clarity of your written report.


Feel free to ask! Send email to