Computational Linguistics - Introductory Handout

CMPSCI 591N : Computational Linguistics
Spring 2006
Homework #1: Regular Expressions

Out: Thursday February 2, 2006
Due: Thursday February 9, 2006, by 11:59pm, by email to compling@cs.umass.edu

In this homework assignment you will write one or several regular expressions, run your expressions on data, and write a short report about the results. There are several suggested tasks below. You only need do one task. But don't be limited by the list below! I encourage you to come up with your own task! The exact assignment is up to your own interests and creativity!

Please check the course Web site syllabus, in the homework column, and one its homepage, for any updates and clarifications to this assignment.

Example Data available

The data used in class on Tuesday is available at http://www.cs.umass.edu/~mccallum/courses/cl2006/data. Most of these were obtained from Project Gutenberg (which you can find with a Google query). You are welcome to obtain additional data from there. The Wall Street Journal text came from ftp://ftp.cis.upenn.edu/pub/chunker.

You are also welcome to find your own data in the Web, run on your own email, or any other large corpus you find.

Python Infrastructure available

Feel free to use the regexs.py and regexcount.py that were demonstrated in class. They are both available at http://www.cs.umass.edu/~mccallum/cl2006/code. You are also welcome to develop your own Python programs, if you prefer and feel able.

Run them by typing, for example,
python regexs.py 'that that' *.txt
or
python regexcount.py '\bdis(\w+)\b' *.txt
at your machine's command prompt.

Example Tasks

Write a regular expression for monetary amounts (in dollars, francs, marks, lira, etc, using digits and spelled-out numbers, etc). Run this on all the Wall Street Journal text, and report some patterns you notice. Is there a pattern to where digits versus spellings are used? Which monetary unit tends to cooccur with the largest numeric amounts?
Write a regular expression for people names that is triggered by honorifics (Mr, Ms, Senator, etc). Develop and debug it on newspaper text. How many names does it find? Who appears the most requently? How many strings does it find that are errors of some sort---either not actually people names, or incorrect boundaries? Now run your regular expression one some different text---say, Jane Austin's books, or Shakespeare's plays. How many errors does it make there?
Write several regular expressions that detect certain mophological patterns. Look for patterns of repeted consonants, or vowels. Look for repeated syllables. Make a list of morpheme suffixes, and find the words that contain the longest string of them.
Use the Wall Street Journal file that contains parts-of-speech, and look for certain patterns in parts-of-speech. What parts of speech are most likely after the phrase "that that"? What three-word part-of-speech sequences are most likely after a verb? What nouns tend to be modied by "strong" versus "powerful"?
Gender studies. Look for uses of "he" and "she," "his" and "her," and look for patterns in the words around them. Are certain verbs or adjectives more likely around men than women? What about differences between Jane Austin, Shakespeare and Walt Whitman?

Or, also feel free to consider some more involved (and fun!) variants:

Write you own little version of ELIZA. Use a series of regular expression substitutions to create a system that engages in a typed conversation. You won't be able to do this using regexs.py and regexcount.py. You'll need to write your own Python program.
Write a program that takes English words as input, and attempts to perform a morphological parse. Rather than implementing a finite-state transducer, you might do this with a cascade of regular expressions and if-the-else's.

What to hand in, and how

The homework should be emailed to compling@cs.umass.edu before 11:59pm on Thursday February 9, 2006.

In addition to writing your regular expression or program, write a short report about your motivations for the task you chose, your experiences in implementing it, your experiences in running it, and your findings. Feel free to suggest other additional things you might like to to next that build on what you've done so far. This report should be clear, well-written, but needn't be long--about one page is fine. Also, no need for fancy formatting. In fact, you can just type this report as the body of your email. Your regular expression or program can also be included in the body, or included as an email attachment.

Grading

The assignment will be graded for (a) creativity of your selected task, (b) good use of regular expressions, (c) quality of your analysis of the results, (d) quality/clarity of your written report.

Questions?

Feel free to ask! Send email to compling@cs.umass.edu.

CMPSCI 591N : Computational Linguistics Spring 2006 Homework #1: Regular Expressions

Out: Thursday February 2, 2006 Due: Thursday February 9, 2006, by 11:59pm, by email to compling@cs.umass.edu