This assignment is due in two weeks, at the start of class on
Thursday, September 24. Mail it
to
Download the some example books from Project Gutenberg that are included with the Natural Language Toolkit: http://nltk.googlecode.com/svn/trunk/nltk_data/packages/corpora/gutenberg.zip.
When you unzip the file (with, e.g., unzip
on Linux
and Mac OS X), you should get a directory with several plain
text .txt
files.
This exercise is open ended. Use the regular expression python
programs regexs.py
and regexcount.py
from
class, or your other favorite language, or grep, to explore this
corpus. If you use the python code, the first argument is a
regular expression and the rest are files. For instance, you could
run
./regexs.py ' [A-Za-z]{20,} ' gutenberg/*.txtto search all the Project Gutenberg texts for words over 20 characters long. Note the use of quotes to ensure that the spaces are interpreted as part of the regular expression.
Find interesting patterns, and send us the regexes that describe those patterns, some example results, and an explanation for why they're interesting. Some examples include: common morphological suffixes, patterns of verbs that introduce dialogue in novels, patterns that indicate proper names, patterns that indicate verbs.
This exercise is based on Jason Eisner and Noah Smith's Competitive Grammar Writing paper.
This section has two goals:
First, download the initial
grammar: grammar1
. The
file contains self explanatory comments, which are delimited
by #. The basic format is:
1 VP Verb NPIgnore the leading "1" for now. The first symbol is the left-hand side of a context-free rewrite rule; the remaining one or more symbols are right-hand-side terminals and non-terminals. There is no typographic distinction enforced between terminals and non-terminals; rather, any symbol that does not appear on a left-hand side is by definition a terminal.
Now let's get down to work:
./generate grammar1 5In other words, print five random sentences from the grammar
grammar1
to standard output.
You should see output like:
the president ate every sandwich !
Your program should start with the ROOT
symbol and recursively choose rewrite rules until
termination. In grammar1
, there are three
possible expansions of grammar1
; you should
have an equal chance of choosing each of them. Don't
hardwire this grammar into your program. Read it anew each
time it runs, so that you can modify the grammar
later.
Save and hand in 10 random sentences generated by this first version of the program.
grammar1
to grammar2
and
modify it. Instead of just "1", allow any non-negative real
number at the beginning of a rule. For instance, you can
assert that the is a more common determiner
than a or every like so:
4 Det the 1.5 Det a 0.5 Det everyIn particular, they would be distributed (2/3, 1/4, 1/12).
Play around with the weights
in grammar2
and see how your generated
sentences change. Can you make the average sentence
lenght much longer or shorter?
Hand in the new probabilized generation program,
your modified grammar2
, and ten random
sentences.
grammar2
to grammar3
. Modify the latter so that, in
addition to the strings of grammar2
, it can also
generate phenomena like:
Alex kissed a sandwich . that Alex kissed a sandwich perplexed the president . the president thought that Alex kissed a sandwich . the president smiled . Alex ate a sandwich and wanted a pickle . the president and the chief of staff understood that Alex smiled .In addition to new terminals, like Alex, you'll need to add new non-terminals. Of course, you don't want to generate just the six sentences above, but others with the same constructions.
Hand in your expanded grammar.
grammar4
. Modify it
to model some new constructions of English that you find
interesting. Explain in comments to your grammar file what you
are doing.