## CMPSCI 591N : Introduction to Natural Language Processing Fall 2007 Homework #1: Regular Expressions

In this homework assignment you will write one or several regular expressions, run your expressions on data, and write a short report about the results. There are several suggested tasks below. You only need do one task. But don't be limited by the list below! I encourage you to come up with your own task! The exact assignment is up to your own interests and creativity!

Please check the course Web site syllabus, in the homework column, and one its homepage, for any updates and clarifications to this assignment.

### Example Data available

The data used in class on Tuesday is available at http://www.cs.umass.edu/~mccallum/courses/inlp2007/data. Most of these were obtained from Project Gutenberg (which you can find with a Google query). You are welcome to obtain additional data from there. The Wall Street Journal text came from ftp://ftp.cis.upenn.edu/pub/chunker.

You are also welcome to find your own data in the Web, run on your own email, or any other large corpus you find.

### Python Infrastructure available

Feel free to use the regexs.py and regexcount.py that were demonstrated in class. They are both available at http://www.cs.umass.edu/~mccallum/inlp2007/code. You are also welcome to develop your own Python programs, if you prefer and feel able.

Run them by typing, for example,
python regexs.py 'that that' *.txt
or
python regexcount.py '\bdis(\w+)\b' *.txt

• Write a regular expression for monetary amounts (in dollars, francs, marks, lira, etc, using digits and spelled-out numbers, etc). Run this on all the Wall Street Journal text, and report some patterns you notice. Is there a pattern to where digits versus spellings are used? Which monetary unit tends to cooccur with the largest numeric amounts?
• Write a regular expression for people names that is triggered by honorifics (Mr, Ms, Senator, etc). Develop and debug it on newspaper text. How many names does it find? Who appears the most requently? How many strings does it find that are errors of some sort---either not actually people names, or incorrect boundaries? Now run your regular expression one some different text---say, Jane Austin's books, or Shakespeare's plays. How many errors does it make there?
• Write several regular expressions that detect certain mophological patterns. Look for patterns of repeted consonants, or vowels. Look for repeated syllables. Make a list of morpheme suffixes, and find the words that contain the longest string of them.
• Use the Wall Street Journal file that contains parts-of-speech, and look for certain patterns in parts-of-speech. What parts of speech are most likely after the phrase "that that"? What three-word part-of-speech sequences are most likely after a verb? What nouns tend to be modied by "strong" versus "powerful"?
• Gender studies. Look for uses of "he" and "she," "his" and "her," and look for patterns in the words around them. Are certain verbs or adjectives more likely around men than women? What about differences between Jane Austin, Shakespeare and Walt Whitman?

Or, also feel free to consider some more involved (and fun!) variants:

• Write you own little version of ELIZA. Use a series of regular expression substitutions to create a system that engages in a typed conversation. You won't be able to do this using regexs.py and regexcount.py. You'll need to write your own Python program.
• Write a program that takes English words as input, and attempts to perform a morphological parse. Rather than implementing a finite-state transducer, you might do this with a cascade of regular expressions and if-the-else's.

### What to hand in, and how

The homework should be emailed to cs585-staff@cs.umass.edu.