Statistical Information Extraction

CMPSCI 791S, Spring 2003
Friday 1:30-4pm, CS Rm. 203
Instructor: Andrew McCallum, CS Rm 242, 545-1323


The Web is the world's largest knowledge base.  However, its data is in a form intended for human reading, not manipulation, mining and reasoning by  computers.  Today's search engines help people find web pages.  Tomorrow's search engines will also help people find "things" (like people, jobs, companies, products), facts and their relations.
   Information extraction is the process of filling fields in a database by automatically extracting sub-sequences of human readable text.  It is a rich and difficult problem involving the need to combine many sources of evidence using complex models that have many parameters---all estimated with limited labeled training data.
   This course will survey many of the sub-problems and methods of information extraction, including use of finite state machines and context-free grammars, language and formatting features, generative and conditional models, rule-learning and Bayesian techniques.  We will discuss segmentation of text streams, classification of segments into fields, association of fields into records, and clustering and de-duplication of records.
   Along the way we will explore many of the mainstays of statistical modeling, including maximum likelihood, expectation maximization, estimation of multinomial and Dirichlet distributions, maximum entropy methods, discriminative training, Bayesian networks, factorial Markov models, variational approximations, mixture models, semi-supervised training methods.
   Most of all we will have a tremendous amount of fun together learning new things in a dynamic, challenging, yet safe-for-silly-questions environment.

Target class size: 15 or less.


CompSci 689 (Machine Learning), or
Stats 511 (Computational Multivariate Analysis), or
similar background with permission of instructor.

Grading Criteria

30% Classroom discussion
20% Research point presentations
10% Reading response papers (due Thursday noon, electronically submitted, late submission not accepted)
10% Quizes (lowest quiz grade dropped)
30% Research project: proposal report and presentation, final report and presentation

A reading response paper is a half page or less of plain text that gives ~1-3 insightful sentences each on (1) a summary of the paper's main point, (2) something you liked, (3) a critique of some aspect, (4) something you didn't understand or a question.  Write your response in plain ASCII text, put it in a file called "response" on loki.cs, and then deposit it by running ~mccallum/public_html/courses/ie2003/bin/ response
   A research point presentation is a 10-20 minute in-class presentation on an assigned research question or point that is related to the reading.  Examples include: (1) give an introduction to the mechanics of AdaBoost, (2) compare the two different kinds of "shrinkage" in the two assigned readings, (3) give an introduction to string kernels and why they are interesting, (4) walk the class through the derivation of the "gain" in the "Inducing Features..." paper.
   Each student will do a reading response paper for every assigned paper, multiple research point presentations, and one research project.  All must be the student's own work.

Syllabus and Reading List

Papers subject to change up to 2 weeks before class.
# 1
January 31
Class Introduction and Outline.
IE overview slides
Point Presentations (Andrew McCallum)
   Naive Bayes data/likelihood/inference/estimation
   Derivation of the Maximum Likelihood Estimate, via Lagrange Multipliers
February 7
HMMs for IE & Named Entity Extraction
An Algorithm that Learns What's in a Name.  Daniel Bikel, Richard Schwartz and Ralph Weischedel, 1999.
Information Extraction with HMMs and Shrinkage. Dayne Frietag and Andrew McCallum, 1999.
Point presentations
  Named entity data and Identifinder error analysis:  Hema Ragavan
  Comparison of shrinkage in each model:  Jeremy Pickens
  Reading responses Top-10:  Brent Heeringa
# 3
February 14
Maximum Entropy Classification
A maximum entropy approach to natural language processing.  A. Berger, S. Della Pietra and V. Della Pietra, 1996.
Using Maximum Entropy for Text Classification. Kamal Nigam, John Lafferty, Andrew McCallum, 1999.
A comparison of algorithms for maximum entropy parameter estimation.  Robert Malouf, 2002.
Point presentations:
  MaxEnt data/likelihood/inference/estimation:  Andrew McCallum
  Generative vs Conditional MaxEnt:  Ramesh Nallapati
  BFGS overview and intuition:  Aron Culotta
  Review of MaxEnt uses in the HLT literature:  Fernando Diaz
  Reading responses Top-10:  David Stracuzzi: 
# 4
February 21
Conditional Finite State Models
Maximum Entropy Markov Models for Information Extraction and Segmentation. Andrew McCallum, Dayne Freitag and Fernando Pereira, 2000.
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. John Lafferty, Andrew McCallum and Fernando Pereira, 2001
(Additional optional reading:  A Maximum Entropy Part-Of-Speech Tagger. Adwait Ratnaparkhi, 1996.)
Point Presentations:
  HMM data/likelihood/inference/estimation:  Aron Culotta
  MEMM & CRF data/likelihood/inference/estimation:  Andrew McCallum
  Presentation of Collins' paper, Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms. 2002:  ___
  Reading responses Top-10:  Vanessa
Project Proposals: Vanessa, Ramesh, Wei, Fernando, Jeremy
# 5
February 28
Conditional Finite State Models, Round 2
Shallow Parsing with Conditional Random Fields.  Fei Sha and Fernando Pereira, 2003.
(Additional optional reading:  Efficient Training of Conditional Random Fields.  Hanna Wallach, 2002.)
Point Presentations:
  Last week's Top-10 again: Vanessa
  CRFs: Andrew
  Top-10: Pippin
Project Proposals:  Ben & Joshua, Jerod, Hema, Peter, Andy
# 6
March 7
Feature Induction and Boosting
Inducing Features of Random Fields. Stephen Della Pietra, Vincent Della Pietra, John Lafferty, 1995.
 (Skipping section 4)
Boosting Applied to Tagging and PP Attachment. Steven Abney and Robert E. Schapire and Yoram Singer, 1999.
(Additional optional reading:
  Transformation-Based Error-Driven Learning and Natural Language Processing.  Eric Brill, 1995.)
Point Presentations:
  Overview of Boosting: David
  Introduction to Transformation-Based Learning: Ben
  Review of "Gain" in Della Pietra el al.:  Andrew
# 7
March 14
Feature Induction and Boosting, Round 2
  Toward Optimal Feature Selection.  Daphne Koller and Mehran Sahami, 1996.
(Additional optional reading:
  Feature Selection for a Rich HPSG Grammar Using Decision Trees.  Chris Manning 2002.
  Boosting and maximum likelihood for exponential models.  Guy Lebanon and John Lafferty, 2002.)
Point Presentations:
  Top-10: Joshua
  Top-10b: Peter
Project Proposals: Khash, Brent, Pippin, David, Alvaro, Aron
# 8
March 21
Spring Break
# 9
March 28
Finite State Structure Induction & Factorial Markov Models
Inducing Probabilistic Grammars by Bayesian Model Merging.  A. Stolcke and S. Omohundro, 1994.
Factorial hidden Markov models.  Z. Ghahramani, M. Jordan.  1995.
(Additional optional reading:
  Information Extraction with HMM Structures Learned by Stochastic Optimization. Dayne Freitag and Andrew McCallum, 2000.
  Probabilistic DFA Inference using Kullback-Leibler Divergence and Minimality.  F. Thollard, P Dupont, C. Higuera
  A Coupled HMM for Audio-Visual Speech Recognition.  A. Nefian, et al. 2002.
  Audio-Visual Sound Separation Via Hidden Markov Models.  John Hershey and Michael Casey.  2001.
  Structure learning in conditional probability models via an entropic prior and parameter extinction.  Matt Brand.
  Learning Hidden Markov Model Structure for Information Extraction.  K. Seymore, et al. 1999.
  Factorial Markov Random Fields.  J. Kim and R. Zabih, 2002.)
Point presentations:
  Top-10:  __Alvaro___
  Introduction to factorial finite state machines:  __Khash__
  Overview of Hershey and Casey:  ___Jerod___
  Introduction to Bayesian Model Merging:  _Andrew__
  Overview of Seymore et al.:  ___Andy___
Project Proposal: Jen
# 10
April 4
Parsing and IE (Andrew out of town)
Three Generative, Lexicalised Models for Statistical Parsing.  Michael Collins, 1997.
A Novel Use of Statistical Parsing to Extract Information from Text.  Scott Miller et al  2000.
(Additional optional reading:
  Parsing the Wall Street Journal using a Lexical-Functional Grammar and Discriminative Estimation Techniques. Riezler, et al, 2002.)
Point Presentations:
  Introduction to PCFG parsing & inside-outside algorithm:  ___Brent Heeringa___
  Collins paper:  __Vanessa____
  Miller paper:  ___Wei__
  Riezler paper:  __Brent?____
  Top-10: __Peter__ 
# 11
April 11
Reference-Matching, Co-reference, Identity Uncertainty and other Relations
Probabilistic Reasoning for Entity & Relation Recognition.  D Roth and W. Yih. 2002.
Unpublished paper on relational models of IE.
(Additional optional reading:
  Representing Sentence Structure in Hidden Markov Models for Information Extraction.  Mark Craven. 2001.
  Identity Uncertainty.  Stuart Russell.  2001.
Coreference for NLP Applications. Thomas Morton, 2000.
Learning to Match and Cluster Entity Names. Cohen and Richman.  2001.
Identity Uncertainty and Citation Matching.  Pasula et al.  2002. )

Point Presentations:
  Top-10:  ________________ 
# 12
April 18
Semi-supervised Learning for IE
Unsupervised Models for Named Entity Classification. Michael Collins and Yoram Singer, 1999.
Learning Dictionaries for Information Extraction by Multi-Level Bootstrapping, Ellen Riloff and Rosie Jones.
Combining Labeled and Unlabeled Data with Co-Training.  A. Blum and T. Mitchell, 1998.
Learning with labeled and unlabeled data.  M. Seeger, 2001.
Text Classification from Labeled and Unlabeled Documents.  K. Nigam et al. 1999.
Information regularization with partially labeled data.  M. Szummer and T. Jaakkola. 2002.
Learning with Scope, with Application to Information Extraction and Classification. D Blei, et al. 2001.
Latent Dirichlet Allocation
An Introduction to Variational Methods for Graphical Methods.  M. Jordan et al.  1998.
Point Presentations:
  Top 10:  _________
  Introduction to Variational Methods:  __Andrew__
  Introduction to Co-training:  __Jen__
  Overview of Szummer & Jaakkkola:  _________
  Overview of learning with labeled and unlabel data (Seeger paper):  __Wei?___
# 13
April 25
Project Presentations
# 14
May 2
Project Presentations
# 15
May 9
Project Presentations
...and wrap-up

Additional Topics:

Kernel Methods for Text
(Yes, Friday afternoon before Spring Break!)
Introduction to Large Margin Classifiers. Smola, Bartlett, Schoelkopf, Schuurmans
Maximum entropy discrimination.  T. Jaakkola, M. Meila, and T. Jebara. 1999.
String Matching Kernels for Text Classification.  H. Lodhi, C. Saunders, N. Cristianini, C. Watkins, J. Shawe-Taylor
(Additional optional reading:
  Text Categorization with Support Vector Machines.  Thorsten Joachims. 1998.
  Some SVM IE paper, Gaussian Processes)
Point presentations:
  Top-10: ________
  SVM overview: _____________
  Connections between MaxEnt and SVMs: ____________
  Explanation of string kernels:  _____________

Integration of IE with Data Mining
Ray Mooney paper
Dan Roth paper

Wrapper Induction and Multi-modal IE
Boosted Wrapper Induction.  Kushmeric and Frietag
LDA model of images and captions.  Blei and others.
Something from InfoMedia project at CMU.