CMPSCI 585 : Introduction to Natural Language Processing
Fall 2007
Handout #1: Course Information


Meeting Time and Locations

Lecture: Tuesday/Thursday 2:30-3:45pm in Computer Science Building, Room 140

Electronic Communications

Course web page:

Mailing lists:
Announcements, revisions, homework hints, etc.
It is important that all students are subscribed to this mailing list.
To subscribe send email to with SUBSCRIBE cs585 in the subject line.
Send your questions here! Questions, clarrifications, comments, to the course professor, TAs, and assistant.
We welcome all comments and suggestions about the course!


Teaching Staff       

Professor: Andrew McCallum
Office: Computer Science Building, Room 242
Office Hours: Tuesday 3:45-4:45pm (subject to change)

Phone: (413) 545-1323
Fax: (413) 545-1789

Professor McCallum's work aims to dramatically increase our ability to mine actionable knowledge from unstructured text. He is especially interested in information extraction from the Web, understanding the connections between people and between organizations, expert finding, social network analysis, and mining the scientific literature & community. Toward this end his research group develops and employs various methods in statistical machine learning, natural language processing, information retrieval and data mining---tending toward probabilistic approaches and graphical models.

Andrew McCallum is an Associate Professor in the Computer Science Department at University of Massachusetts Amherst. He was previously Vice President of Research and Development at WhizBang Labs, a company that used machine learning for information extraction from the Web. In the late 1990's he was a Research Scientist and Coordinator at Justsystem Pittsburgh Research Center, where he spearheaded the creation of CORA, an early research paper search engine that used machine learning for spidering, extraction, classification and citation analysis. McCallum received his PhD from the University of Rochester in 1995, followed by a post-doctoral fellowship at Carnegie Mellon University. He is currently an action editor for the Journal of Machine Learning Research, and on the board of the International Machine Learning Society. For the past ten years, McCallum has been active in research on statistical machine learning applied to text,
especially information extraction, document classification, clustering, finite state models, semi-supervised learning, and social network analysis. He has given numerous invited talks, including presentations at MIT, Stanford, Berkeley, CMU, U. Washington, Brown, Xerox PARC, IBM Almaden, IBM Watson, SRI, AT&T Research, Yahoo and Google. New work on search and bibliometric analysis of open-access research literature can be found at McCallum's web page:

TA: David Mimno (graduate student)
Office: Computer Science Building, Room 264
Office hours: TBD, see course web site.
Phone: (413) 545-3616 (during office hours only)

TA: Karl Schultz (graduate student)
Office: Computer Science Building, Room 264
Office hours: TBD, see course web site.
Phone: (413) 545-3616 (during office hours only)

Assistant: Hanna Wallach (post-doc)
Office: Computer Science Building, Room 264
Office hours: TBD, see course web site.
Phone: (413) 545-3616 (during office hours only)

Assistant: Khash Rohanemanesh (post-doc)
Office: Computer Science Building, Room 264
Office hours: TBD, see course web site.
Phone: (413) 545-3616 (during office hours only)


Course Objective

To introduce students to both fundamental concepts of computational linguistics and natural language processing (NLP), as well as some current research in the area. To give students hands-on experience using computational tools to manipulate natural languages.

Course Description

Natural Language Processing addresses fundamental questions at the intersection of human languages and computer science. How can computers acquire, comprehend and produce English? How can computational methods give us insight into observed human language phenomena? How can you get a job at Google? In this interdisciplinary introductory course, you will learn how computers can do useful things with human languages, such as translate from French into English, filter junk email, extract social networks from the web, and find the main topics in the day's news. You will also learn about how computational methods can help linguists explain language phenomena, including automatic discovery of different word
senses and phrase structure. Over the past decade, natural language processing has been revolutionized by statistical and probabilistic methods; you will learn about robust approaches to parameter estimation and inference. Our work will include learning new methods, discussions, and hands-on laboratories.

Whether you are interested in the intersection between the humanities and computer science, or you want a job at a Silicon Valley web company, this introductory course will help you on your way.

Intended Audience:

This course is aimed at CS and Linguistics undergraduates, and Linguistics graduate students.


Prerequisites: Either CMPSCI 287 or LINGUIST 401, or graduate standing in Linguistics. (Computer Science graduate students may only do so with permission of the instructor.)

Expected skills:

• Basic familiarity with logic, basic mathematics (logs, exponents, etc), basic probability by ratio of counts.
• Ability to use a computer, word processor. Readiness to learn a programming language (Python).


Course Materials

The text is Jurafsky and Martin, Speech and Language Processing.

The second edition of the text is only available online at Our course will recommend certain chapters for certain weeks, and the material will certainly reinforce the lectures, but we will not follow the book extremely tightly, and the homework assignments will not come out of the book.

See also for supplementary information about the text, including errata, and pointers to online resources.

The following text are useful but optional:
• Chris Manning and Hinrich Schütze. 1995. Foundations of Statistical Natural Language Processing. Benjamin/Cummings, 2ed. (You can read this text online at Each student should be able to get in if they use their "Umail" (OIT) username and password. The UMass Library said that all students should have such an account because they need that for all other services in UMass (even if they use a CS account primarily). As an alternative, if they are accessing from on-campus, they can go in through the UMass library page and get in without a password. Go to: Then click on "databases" and type cognet. Then, click on the "cognet" site and it will get you access to books, journals, etc.)
• James Allen. 1995. Natural Language Understanding. Benjamin/Cummings, 2ed.
• Gerald Gazdar and Chris Mellish. 1989. Natural Language Processing in LISP. Addison-Wesley.

We will be using the programming language Python in this class. There are many excellent Python tutorials on-line, includings some for experienced programmers, some for those new to programming, and even some Linguists who are new to programming.

• For linguists new to programming:
• Other Python pointers for linguists:
• The Natural Language Toolkit (NLTK) in Python:

Additional handouts and papers will occasionally be distributed and discussed during the course of the class. Electronic copies (when available) can be accessed from the syllabus.


Hardware/Software Requirements

Students can use their own computers. If you do not have access to a computer, see the Instructor as soon as possible, and we will make other arrangements for you. Materials for class assignments will be made available via the web, and so internet access will be required.



25% homework assignments (these will also include opportunities for extra credit.)
20% final project
20% midterm exam
25% final exam
10% classroom participation & possible "collaborative exercise" quizzes

Homework submission: Homework is due by email attachment to by 11:59pm on the date indicated on the homework assignment. Late homework submissions may be accepted at the discretion of the instructor, but not after a solution set has been handed out. There will be grading penalties for late assignments.

Project Collaboration: One of the exciting things about this course is that we will be bringing together people with different backgrounds in computer science and natural language. We will take advantage of this by doing final projects in teams. I hope students will learn a lot from each other. As part of the write-up for the project, each student will write a brief assessment of their own contribution to the assignment, as well as that of their teammates. These, along with my own impressions of the contributions and teamwork will go into the individual grades assigned.

Homework Collaboration: This fruitful collaboration shouldn't wait only until the project, however. I encourage students to meet outside of class, discuss the classwork, and even work side-by-side on homework assignments. For each homework assignment, you can of course do it on your own, but you also have the option of working closely in a small group. You can discuss the assignment, the solutions, possible extensions to the assignments that you might want to add. You will, not however, hand in a single, joint assignment. In the end, each student should write up their own assignment, write their own program, and hand in their own work. You also must write clearly at the top of the assignment, who you collaborated with, and in what capacity. (See also "Academic Honesty" below.) If the line between "encouraged collaboration" and "cheating" isn't clear, please ask the instructor!

One recommended way to do the homework, especially for those new to programming (e.g. linguistics students), is to do the entire assignment during office hours, with a TA by your side. There are multiple TAs and extensive office hours especially for this purpose. Learning to program can be frustrating when done in isolation. I don't want "programming frustrations" to be a factor in this course, so you are welcome to do all of your programming in the presence of a TA, who will help you through the technical details and silly "gotchas", so you can focus on the Computational Linguistics material. Note that you can combine this recommendation with the collaboration recommendation, and show up to TA office hours with your collaborative group, and do the assignment all together there.

Rescheduling exams: Exams may be taken other than at the scheduled time, but only under exceptional circumstances and then only if approved by the instructor well before the exam. Makeup exams will rarely be the same as the original exam, and will usually be all or partly oral.

Academic Honesty: Your work must be your own, or that of your own project team. You are encouraged to discuss problems, ideas and inspirations with other students, but the final answers, the programming, the writing, and the final result that you hand in must be your own or your own project team's effort. If you have questions about what is honest, please ask! You are strongly encouraged to cite your sources if you received extraordinary help from any person or text (including the Web). Department policy specifies that the penalty for cheating or plagiarism is (1) a final course grade of "F" and (2) possible referral to the Academic Dishonesty Committee. The UMass policy can be found here.

Policy on Regrading: We do make every effort to ensure that your exam or assignment is graded right the first time! However, sometimes people miss things, or there can be disagreements in interpretation. If you're unhappy with the grade for a question, you need to make a written request for a regrade and to resubmit your entire exam or homework, either to one of the TAs or to the instructor. The request doesn't have to be formal and long. Simply writing on a sheet of paper "8 points were taken off question 3, but I think it's a perfectly valid answer to the question" is sufficient. Normally, the TA will regrade it. If you're still not happy, you should repeat this process, but indicate that you want the instructor to re-regrade it. Negating this policy: you should not e-mail grading complaints, and you can't expect assignments to be regraded "while you wait".

Auditing: If you are interested in auditing the course, please contact the instructor. Official auditors will normally be expected to complete all of the homeworks and programming assignments, and to achieve at least a C-level performance. Anyone enrolled for audit should contact the instructor early in the semester to discuss the requirements for receiving audit credit for this course.  If the course is heavily over-enrolled, auditing may not be possible.

Attendance: Students are expected to attend each class. Attendance will not be taken directly, but absence may be noted because of occasional in-class assignment. The official means of communication for this course will be in-class announcements, though every effort will be made to ensure that important announcements go out on the course mailing list or appear on the course Web pages.

Course Web page: The class World Wide Web page is Assignments, online materials, and notes about assignments will be available from this page.