CMPSCI 585 - Intro to Natural Language Processing

CMPSCI 585 Home

Course Description
Textbook & Resources
Syllabus & Slides
Homework assignments
Policies & Grading

Introduction to Natural Language Processing

CMPSCI 585
Spring 2004

Homework

Each homework assignment consists a few questions with written answers. They are intended to be extremely short practical exercises that will give you some practical experience, and give you some idea what to expect in the mid-term and final.

Homework #1.

Programming Assignments

Programming assignments consist of writing a short program (which can be done in the programming language of your choice), performing a few experiments on text data we will provide, and writing brief descriptions of your findings. We will provide detailed directions and hints that should allow you to focus on the NLP aspects of the assignment, rather than software engineering.

Below are some brief descriptions. Full details of each homework will be made available when it is assigned.

Programming Assignment #1: Implement a CYK parser that will take as input a simple grammar, and the produce a parse of example sentences. Sample data needed for the assignment.

Programming Assignment #2: Implement a naive Bayes document classifier, apply it to junk email filtering (or some other document collection of your choice), perform a few simple experiments. Additional helpful material: a paper describing the multinomial event model, Tom Mitchell's textbook has an excellent introduction to naive Bayes document classification.

Programming Assignment #3: Implement a simple hidden Markov model, trained in a non-hidden fashion, run the Viterbi algorithm on some test part-of-speech tagging data. Try some variations, and describe your results.

Final Project: Implement and explore an NLP task of your choosing. Examples, might include (1) extracting from the Web names and job titles of business people who used to go to UMass, (2) clustering text from different languages to discover a family tree of languages, (3) implementing an improved parser, (4) a simple machine translation system, (5) clustering your email to create folders, (6) a lexical acquisition system for a particular technical domain, (7) part-of-speech tagging, trained from labeled and unlabeled data, (8) Chinese word segmentation with HMMs. Check out papers at recent NLP conferences to get more ideas: EMNLP 2003, HLT 2003, ACL 2003.