Computatinal Linguistics
CMPSCI 591N Home

Course description

Textbook & Resources

Syllabus & Slides

Policies & Grading

Computational Linguistics

CMPSCI 591N — Spring 2006
Spring 2006

Syllabus

Key:
JM = Jurafsky & Martin "Speech and Language Processing"
MS = Manning and Schutze "Foundations of Statistical Natural Language Processing"

Date Topics Readings

Assignments

Tue
Jan 31
Introduction and Overview
Welcome, motivations, what is computational linguitics, hands-on demonstrations. Ambiguity and uncertainty in language. The Turing test. Course outline and logistics. Questionaire. Handout. Slides.

JM Ch 1
Optional:
MS Ch 1, historical overview.

 
Thu Feb 2 Regular Expressions
Chomsky hierarchy, regular languages, and their limitations. Finite-state automata. Practical regular expressions for finding and counting language phenomena. A little morphology. Slides.
JM Ch 2
Optional:
JM Ch 3 updated from book.
Install Python. HW#1 out: RegEx on corpora. Tools.
Tue Feb 7 Programming in Python
An introduction to programming from square one. Why Python? Variables, numbers, strings, arrays, dictionaries, conditionals, iteration. The NLTK (Natural Language Toolkit). Slides.
Refer to online programming resources, and Learning Python, at your own pace.  

Thu Feb 9

String Edit Distance
Key algorithmic tool: dynamic programming, first a simple example, then its use in optimal alignment of sequences. String edit operations, edit distance, and examples of use in spelling correction, and machine translation. Slides.
JM Ch 5.6
Optional extras: web

HW#1 due.

HW#2 out: String edit distances

Tue Feb 14 Context Free Grammars
Constituency, CFG definition, use and limitations. Chomsky Normal Form. Top-down parsing, bottom-up parsing, and the problems with each. The desirability of combining evidence from both directions. Slides.
JM Ch 9  
Thu Feb 16 Non-probabilistic Parsing
Efficient CFG parsing with CYK, another dynamic programming algorithm. Also, perhaps, the Earley parser. Designing a little grammar, and parsing with it on some test data. Slides.
JM Ch 10

HW#2 due.

HW#3 out: Designing a little grammar, and parsing with CYK.

Tue Feb 21 NO CLASS (This Tuesday follows Monday schedule.)    
Thu Feb 23 Probability
Introduction to probability theory--the backbone of modern natural language processing. Events, and counting. Joint and conditonal probability, marginals, independence, Bayes rule, combining evidence. Examples of applications in natural language. (Use a little calculus?!) Slides.
JM Ch 5.4, 5.8

HW#3 due.

HW#3b out: Extended version of HW#3.

Tue Feb 28

Information Theory
What is information? Measuring it in bits. The "noisy channel model." The "Shannon game"--motivated by language! Entropy, cross-entropy, information gain. Its application to some language phenomena. Slides.
JM Ch 6.7  
Thu Mar 2 Language modeling and Naive Bayes
Probabilistic language modeling and its applications. Markov models. N-grams. Estimating the probability of a word, and smoothing. Generative models of language. Their application to building an automatically-trained email spam filter, and automatically determining the language (English, French, German, Dutch, Finnish, Klingon?). Slides.
JM Ch 6.1-6.6

HW#3b due.

HW#4 out: Choice: Building a spam filter, or language id

Tue Mar 7 Part of Speech Tagging and Hidden Markov Models
The concept of parts-of-speech, examples, usage. The Penn Treebank and Brown Corpus. Probabilistic (weighted) finite state automata. Hidden Markov models (HMMs), definition and use. Slides.
Updated JM Ch
on HMMs.
 
Thu Mar 9 Viterbi Algorithm for Finding Most Likely HMM Path
Dynamic programming with Hidden Markov Models, and its use for part-of-speech tagging, Chinese word segmentation, prosody, information extraction, etc.

 

HW#4 due.

 

Tue Mar 14 Midterm Review
Go over practice midterm, answer questions.
   
Thu Mar 16 Midterm
   
Tue Mar 21 SPRING BREAK    
Thu Mar 23 SPRING BREAK
   
Tue Mar 28 Collocations and Noun Phrase Parsing
Phrases that mean more than the sum of their parts. Using statistics to automatically discover them. Using word statistics to predict bracketing of noun phrases. Slides.
  HW#5 out: Build a part-of-speech tagger.
Thu Mar 30 Word Sense Disambiguation and Clustering
Homonomy, polysemy, different meanings, the power of context. Language neighborhood as a vector. Agglomerative clustering. Clustering by expectation maximization. Using clustering to discover different word senses. Semi-supervised document classification. Slides.
   
Tue Apr 4 Probabilistic Context Free Grammars
Weighted context free grammars. Weighted CYK. Pruning and beam search.Slides.
   
Thu Apr 6 Parsing with PCFGs
A treebank and what it takes to create one. The probabilistic version of CYK. Also: How do humans parse? Experiments with eye-tracking. Modern parsers. Slides.
JM Ch 18 HW#5 due.
Tue Apr 11 Project Proposals
Student groups give short presentations on their project idea. Feedback from the rest of class.
Selected readings. HW#6 out: Build a Weighted PCFG for a little language.
Thu Apr 13 Lexical Semantics
Guest lecture by Chris Potts, UMass Linguistics. Slides.
JM Ch 14, 15, 16  
Tue Apr 18 Machine Translation
Probabilistic models for translating French into English. Alignment, translation, language generation. IBM Model #1. Slides.
  HW#6 due. Last HW!
Thu Apr 20 Machine Translation 2
IBM Model #2, and Expectation Maximization. MT evaluation. (Continuation of previous slides.)
Selected readings.  
Tue Apr 25 Information Extraction
Building a database of person & company relations from 10 years of New York Times. Building a database of job openings from 70k company Web pages. Various methods, including HMMs. Slides.
Selected readings. Project progress report due.
Thu Apr 27 Reference Resolution
Models of anaphora resolution. Machine learning methods for coreference. Slides.
Selected readings.  
Tue May 2 Question Answering
Ask the Web: When was Mozart born? What is iron ore? Who is Bill Gates wife? What were the causes of the Korean war? Slides.
Selected readings.  
Thu May 4 Unsupervised Language Discovery
Automatically discovering verb subcategorization. Topic models. Language modeling integrated into social network analysis. Slides.
Selected readings. Project presentation initial write-up due.
Tue May 9 Project Presentations
Student groups present the results of their project. Slides: Parsing, Semantic Entailment, Named Entity Extraction.
   
Thu May 11 Project Presentations
Student groups present the results of their project. Slides: Machine Translation, Humor Generation.
   
Tue May 16 The Future of Computational Linguistics, and Wrap-up
Broad overview, ties between computer science, statistics and linguistics. Upcoming research trends and capabilities.
   
8am Fri May 19 FINAL EXAM
ELab Rm 304.
  Project presentation final write-up due Monday May 22.