CMPSCI 585 : Introduction to Natural Language Processing
Fall 2004
Handout #1: Course Information


Meeting Time and Locations

Lecture: Tuesday/Thursday 2:30-3:45pm in Computer Science Building, Room 140

Electronic Communications

Course web page:

Mailing lists:
Announcements, revisions, homework hints, etc.
To subscribe send email to with SUBSCRIBE cs585 in the body.
All students who were registered as of September 7, 2004, are already subscribed to their accounts.
Send your questions here! Questions, clarrifications, comments, to the course professor and TAs.
We welcome all comments and suggestions about the course!


Teaching Staff       

Professor: Andrew McCallum
Office: Computer Science Building, Room 242
Office Hours: TDB

Phone: (413) 545-1323
Fax: (413) 545-1789

Professor McCallum works primarily on systems that can dramatically increase our ability to mine actionable knowledge from unstructured text. He is especially interested in information extraction from the Web, understanding the connections between people and between organizations, expert finding, social network analysis, and mining the scientific literature & community. Toward this end his research group develops and employs various methods in statistical machine learning, natural language processing, information retrieval and data mining---tending toward probabilistic approaches and graphical models.

Andrew McCallum is an Associate Professor at University of Massachusetts, Amherst. He was previously Vice President of Research and Development at WhizBang Labs, a company that used machine learning for information extraction from the Web. In the late 1990's he was a Research Scientist and Coordinator at Justsystem Pittsburgh Research Center, where he spearheaded the development of technology for statistical text processing, and lead the team that created the research paper search engine now available at In 1996 he was a post-doctoral fellow at Carnegie Mellon University, where he worked with Sebastian Thrun on the Intelligent Building project and with Tom M. Mitchell on the WebKB project. McCallum graduated summa cum laude from Dartmouth College in 1989, and received his PhD in computer science from University of Rochester in 1995, where he worked with Dana Ballard.

Since 1996, McCallum has been active in research on machine learning and statistical methods applied to text, and has over 50 research publications. In 2003 he gave invited tutorials on information extraction at the NIPS and KDD conferences. He is on the editorial board of the Journal of Machine Learning Research, and has served on the program committees for many technical conferences, including IJCAI, AAAI, ICML, NIPS and UAI. He has given invited talks at MIT, Stanford, CMU, UT Austin, U. Washington, Brown, Xerox PARC, IBM Almaden, IBM Watson, SRI, AT&T Research and Google.


TA: Gary Huang
Office: Computer Science Building, Room 264
Office hours: TBD
Phone: (413) 545-3616 (during office hours only)

TA: Aron Culotta
Office: Computer Science Building, Room 264
Office hours: TBD
Phone: (413) 545-3616 (during office hours only)


Course Objective

To introduce students to both fundamental concepts in natural language processing (NLP) as well as some current research in the area.

Course Description

The field of natural language processing is concerned with practical and theoretical issues that arise in getting computers to perform various tasks with human languages. In this introductory course you will learn about techniques for filtering junk email, automatically discovering the different meanings of the word "run", efficiently encoding spelling rules, tagging words according to their part of speech, parsing English sentences, extracting from the Web names of companies employing UMass graduates, automatically translating from one language to another, and modeling language semantics. Our work will be a combination of learning new algorithms, discussing linguistics, and programming useful systems that operate on real data.

Whether you are interested in the intersection between the humanities and computer science, or you want a job at Google, this introductory course will help you on your way.


Intended Audience

Undergraduates in computer science, or undergraduates in linguistics and other areas who have sufficient programming and mathematical skills.
Graduate students not in AI, and who prefer not to take the graduate-level version of this course (Natural Language Processing, 691L) , are also welcome.

Course Materials

The required text is
• Christopher Manning and Hinrich Schütze, Foundations of Statistical Natural Language Processing. MIT Press, 1999.
($75.00 list price; $63.75  from

You can read the text online at
Each student should be able to get in if they use their "Umail" (OIT) username and password. The UMass Library said that all students should have such an account because they need that for all other services in UMass (even if they use a CS account primarily).
As an alternative, if they are accessing from on-campus, they can go in through the UMass library page and get in without a password. Go to: Then click on "databases" and type cognet. Then, click on the "cognet" site and it will get you access to books, journals, etc

See also for supplementary information about the text, including errata, and pointers to online resources.

The following text are useful but optional:
• James Allen. 1995. Natural Language Understanding. Benjamin/Cummings, 2ed.
• Gerald Gazdar and Chris Mellish. 1989. Natural Language Processing in X. Addison-Wesley.
• Dan Jurafsky and James Martin. 2000. Speech and Language Processing. Prentice Hall.

Additional handouts and papers will occasionally be distributed and discussed during the course of the class. Electronic copies (when available) can be accessed from the syllabus.


Hardware/Software Requirements

Students may use their own computers or their Edlab accounts. Materials for class assignments will be made available via the Web, and so network access will be required.


Work and Grading

5 Homework assignments, (10%)

Each homework assignment consists a few questions with written answers. They are intended to be short practical exercises that will give you some practical experience, and give you some idea what to expect in the mid-term and final.

3 Programming assignments (30%)

Programming assignments consist of writing a short program (which can be done in the programming language of your choice), performing a few experiments on text data we will provide, and writing brief descriptions of your findings. We will provide detailed directions and hints that should allow you to focus on the NLP aspects of the assignment, rather than software engineering.

Midterm and Final (15% each)

Final Project (20%)

Implement and explore an NLP task of your choosing. The final project may be a group project (with at most 3 people), but the amount of work should be appropriately scaled to the size of the group, and you should include a brief statement on the responsibilities of different members of the team. Team members will normally get the same grade, but we reserve the right to differentiate in egregious cases. You will give a mini presentation on your project in the last class. You are also asked to submit an electronic copy of the final project write-up, so that we can make a class projects page.

Example project topics could include (1) extracting from the Web names and job titles of business people who used to go to UMass, (2) clustering text from different languages to discover a family tree of languages, (3) implementing an improved parser, (4) a simple machine translation system, (5) clustering your email to create folders, (6) a lexical acquisition system for a particular technical domain, (7) part-of-speech tagging, trained from labeled and unlabeled data, (8) Chinese word segmentation with HMMs. Check out papers at recent NLP conferences to get more ideas.

Class participation and possible quizzes (10%)


Policy on Regrading

We do make every effort to ensure that your assignment is graded right the first time! However, sometimes people miss things, or there can be disagreements in interpretation. If you're unhappy with the grade for a question, you need to make a written request for a regrade and to resubmit your entire homework, either to one of the TAs or to the instructor. The request doesn't have to be formal and long. Simply writing on a sheet of paper "8 points were taken off question 3, but I think it's a perfectly valid answer to the question" is sufficient. Normally, the TA will regrade it. If you're still not happy, you should repeat this process, but indicate that you want the instructor to re-regrade it. Negating this policy: you should not e-mail grading complaints, and you can't expect assignments to be regraded "while you wait".

Academic Honesty

All actual, detailed work on the solution of problem sets must be individual work. You are encouraged to discuss problem sets with each other in a general way, but if you do so, then you must acknowledge the people who you discussed the problem set with at the start of your solutions. You should not look for problem answers elsewhere, but again, if material is taken from elsewhere, then you should acknowledge it. For programming projects, you are not permitted to get programming help from other people. Normally, you are permitted to use pre-existing code, but you must acknowledge code that you have taken from other sources. You will only be evaluated on code that you have written for the project, so it is not in your interests to find code that implements the core algorithms for the project. In general, we will act and expect you to act according to the UMass Academic Honesty policy, (see