Introduction to Natural Language Processing

Computer Science 585 — Fall 2009

Homework #1: Regular and Context-Free Languages

This assignment is due in two weeks, at the start of class on Thursday, September 24. Mail it to dasmith@cs.umass.edu.

Writing Regular Expressions

Download the some example books from Project Gutenberg that are included with the Natural Language Toolkit: http://nltk.googlecode.com/svn/trunk/nltk_data/packages/corpora/gutenberg.zip.

When you unzip the file (with, e.g., unzip on Linux and Mac OS X), you should get a directory with several plain text .txt files.

This exercise is open ended. Use the regular expression python programs regexs.py and regexcount.py from class, or your other favorite language, or grep, to explore this corpus. If you use the python code, the first argument is a regular expression and the rest are files. For instance, you could run

    ./regexs.py ' [A-Za-z]{20,} ' gutenberg/*.txt

to search all the Project Gutenberg texts for words over 20 characters long. Note the use of quotes to ensure that the spaces are interpreted as part of the regular expression.

Find interesting patterns, and send us the regexes that describe those patterns, some example results, and an explanation for why they're interesting. Some examples include: common morphological suffixes, patterns of verbs that introduce dialogue in novels, patterns that indicate proper names, patterns that indicate verbs.

Writing Context-Free Grammars

This exercise is based on Jason Eisner and Noah Smith's Competitive Grammar Writing paper.

This section has two goals:

Write a program that, given a context-free grammar, generates random sentences from the language described by the grammar.
Improve an initial toy grammar of English to cover more phenomena.

First, download the initial grammar: grammar1. The file contains self explanatory comments, which are delimited by #. The basic format is:

	1 VP	Verb NP

Ignore the leading "1" for now. The first symbol is the left-hand side of a context-free rewrite rule; the remaining one or more symbols are right-hand-side terminals and non-terminals. There is no typographic distinction enforced between terminals and non-terminals; rather, any symbol that does not appear on a left-hand side is by definition a terminal.

Now let's get down to work:

Write a random sentence generator in the language of your choice. When handing it in, make sure we can run it like so:
```
	      ./generate grammar1 5
	    
```
In other words, print five random sentences from the grammar grammar1 to standard output. You should see output like:
```
	      the president ate every sandwich !
	    
```
Your program should start with the ROOT symbol and recursively choose rewrite rules until termination. In grammar1, there are three possible expansions of grammar1; you should have an equal chance of choosing each of them. Don't hardwire this grammar into your program. Read it anew each time it runs, so that you can modify the grammar later.
Save and hand in 10 random sentences generated by this first version of the program.
Modify your program to allow for rule expansions with unequal probabilities. To try this out, copy grammar1 to grammar2 and modify it. Instead of just "1", allow any non-negative real number at the beginning of a rule. For instance, you can assert that the is a more common determiner than a or every like so:
```
		  4 Det the
		  1.5 Det a
		  0.5 Det every
		
```
In particular, they would be distributed (2/3, 1/4, 1/12).
Play around with the weights in grammar2 and see how your generated sentences change. Can you make the average sentence lenght much longer or shorter?
Hand in the new probabilized generation program, your modified grammar2, and ten random sentences.
Copy grammar2 to grammar3. Modify the latter so that, in addition to the strings of grammar2, it can also generate phenomena like:
```
	    Alex kissed a sandwich .
	    that Alex kissed a sandwich perplexed the president .
	    the president thought that Alex kissed a sandwich .
	    the president smiled .
	    Alex ate a sandwich and wanted a pickle .
	    the president and the chief of staff understood that Alex smiled .
	  
```
In addition to new terminals, like Alex, you'll need to add new non-terminals. Of course, you don't want to generate just the six sentences above, but others with the same constructions.
Hand in your expanded grammar.
Create a final grammar, grammar4. Modify it to model some new constructions of English that you find interesting. Explain in comments to your grammar file what you are doing.