Statistical Information Extraction

CMPSCI 791S, Spring 2003
Friday 1:30-4pm, CS Rm. 203
Instructor: Andrew McCallum, CS Rm 242, 545-1323

Description

The Web is the world's largest knowledge base. However, its data is in a form intended for human reading, not manipulation, mining and reasoning by computers. Today's search engines help people find web pages. Tomorrow's search engines will also help people find "things" (like people, jobs, companies, products), facts and their relations.
   Information extraction is the process of filling fields in a database by automatically extracting sub-sequences of human readable text. It is a rich and difficult problem involving the need to combine many sources of evidence using complex models that have many parameters---all estimated with limited labeled training data.
   This course will survey many of the sub-problems and methods of information extraction, including use of finite state machines and context-free grammars, language and formatting features, generative and conditional models, rule-learning and Bayesian techniques. We will discuss segmentation of text streams, classification of segments into fields, association of fields into records, and clustering and de-duplication of records.
   Along the way we will explore many of the mainstays of statistical modeling, including maximum likelihood, expectation maximization, estimation of multinomial and Dirichlet distributions, maximum entropy methods, discriminative training, Bayesian networks, factorial Markov models, variational approximations, mixture models, semi-supervised training methods.
   Most of all we will have a tremendous amount of fun together learning new things in a dynamic, challenging, yet safe-for-silly-questions environment.
Target class size: 15 or less.

Prerequisites

CompSci 689 (Machine Learning), or
Stats 511 (Computational Multivariate Analysis), or
similar background with permission of instructor.

Grading Criteria

30% Classroom discussion
20% Research point presentations
10% Reading response papers (due Thursday noon, electronically submitted, late submission not accepted)
10% Quizes (lowest quiz grade dropped)
30% Research project: proposal report and presentation, final report and presentation
A reading response paper is a half page or less of plain text that gives ~1-3 insightful sentences each on (1) a summary of the paper's main point, (2) something you liked, (3) a critique of some aspect, (4) something you didn't understand or a question. Write your response in plain ASCII text, put it in a file called "response" on loki.cs, and then deposit it by running ~mccallum/public_html/courses/ie2003/bin/submit.pl response
A research point presentation is a 10-20 minute in-class presentation on an assigned research question or point that is related to the reading. Examples include: (1) give an introduction to the mechanics of AdaBoost, (2) compare the two different kinds of "shrinkage" in the two assigned readings, (3) give an introduction to string kernels and why they are interesting, (4) walk the class through the derivation of the "gain" in the "Inducing Features..." paper.
Each student will do a reading response paper for every assigned paper, multiple research point presentations, and one research project. All must be the student's own work.

Syllabus and Reading List

Papers subject to change up to 2 weeks before class.

# 1 January 31	Class Introduction and Outline. Self-introductions IE overview slides Point Presentations (Andrew McCallum) Naive Bayes data/likelihood/inference/estimation Derivation of the Maximum Likelihood Estimate, via Lagrange Multipliers
#2 February 7	HMMs for IE & Named Entity Extraction An Algorithm that Learns What's in a Name. Daniel Bikel, Richard Schwartz and Ralph Weischedel, 1999. Information Extraction with HMMs and Shrinkage. Dayne Frietag and Andrew McCallum, 1999. Point presentations Named entity data and Identifinder error analysis: Hema Ragavan Comparison of shrinkage in each model: Jeremy Pickens Reading responses Top-10: Brent Heeringa
# 3 February 14	Maximum Entropy Classification A maximum entropy approach to natural language processing. A. Berger, S. Della Pietra and V. Della Pietra, 1996. Using Maximum Entropy for Text Classification. Kamal Nigam, John Lafferty, Andrew McCallum, 1999. A comparison of algorithms for maximum entropy parameter estimation. Robert Malouf, 2002. Point presentations: MaxEnt data/likelihood/inference/estimation: Andrew McCallum Generative vs Conditional MaxEnt: Ramesh Nallapati BFGS overview and intuition: Aron Culotta Review of MaxEnt uses in the HLT literature: Fernando Diaz Reading responses Top-10: David Stracuzzi:
# 4 February 21	Conditional Finite State Models Maximum Entropy Markov Models for Information Extraction and Segmentation. Andrew McCallum, Dayne Freitag and Fernando Pereira, 2000. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. John Lafferty, Andrew McCallum and Fernando Pereira, 2001 (Additional optional reading: A Maximum Entropy Part-Of-Speech Tagger. Adwait Ratnaparkhi, 1996.) Point Presentations: HMM data/likelihood/inference/estimation: Aron Culotta MEMM & CRF data/likelihood/inference/estimation: Andrew McCallum Presentation of Collins' paper, Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms. 2002: ___ Reading responses Top-10: Vanessa Project Proposals: Vanessa, Ramesh, Wei, Fernando, Jeremy
# 5 February 28	Conditional Finite State Models, Round 2 Shallow Parsing with Conditional Random Fields. Fei Sha and Fernando Pereira, 2003. (Additional optional reading: Efficient Training of Conditional Random Fields. Hanna Wallach, 2002.) Point Presentations: Last week's Top-10 again: Vanessa CRFs: Andrew Top-10: Pippin Project Proposals: Ben & Joshua, Jerod, Hema, Peter, Andy
# 6 March 7	Feature Induction and Boosting Inducing Features of Random Fields. Stephen Della Pietra, Vincent Della Pietra, John Lafferty, 1995. (Skipping section 4) Boosting Applied to Tagging and PP Attachment. Steven Abney and Robert E. Schapire and Yoram Singer, 1999. (Additional optional reading: Transformation-Based Error-Driven Learning and Natural Language Processing. Eric Brill, 1995.) Point Presentations: Overview of Boosting: David Introduction to Transformation-Based Learning: Ben Review of "Gain" in Della Pietra el al.: Andrew
# 7 March 14	Feature Induction and Boosting, Round 2 Toward Optimal Feature Selection. Daphne Koller and Mehran Sahami, 1996. (Additional optional reading: Feature Selection for a Rich HPSG Grammar Using Decision Trees. Chris Manning 2002. Boosting and maximum likelihood for exponential models. Guy Lebanon and John Lafferty, 2002.) Point Presentations: Top-10: Joshua Top-10b: Peter Project Proposals: Khash, Brent, Pippin, David, Alvaro, Aron
# 8 March 21	Spring Break
# 9 March 28	Finite State Structure Induction & Factorial Markov Models Inducing Probabilistic Grammars by Bayesian Model Merging. A. Stolcke and S. Omohundro, 1994. Factorial hidden Markov models. Z. Ghahramani, M. Jordan. 1995. (Additional optional reading: Information Extraction with HMM Structures Learned by Stochastic Optimization. Dayne Freitag and Andrew McCallum, 2000. Probabilistic DFA Inference using Kullback-Leibler Divergence and Minimality. F. Thollard, P Dupont, C. Higuera A Coupled HMM for Audio-Visual Speech Recognition. A. Nefian, et al. 2002. Audio-Visual Sound Separation Via Hidden Markov Models. John Hershey and Michael Casey. 2001. Structure learning in conditional probability models via an entropic prior and parameter extinction. Matt Brand. Learning Hidden Markov Model Structure for Information Extraction. K. Seymore, et al. 1999. Factorial Markov Random Fields. J. Kim and R. Zabih, 2002.) Point presentations: Top-10: __Alvaro___ Introduction to factorial finite state machines: __Khash__ Overview of Hershey and Casey: ___Jerod___ Introduction to Bayesian Model Merging: _Andrew__ Overview of Seymore et al.: ___Andy___ Project Proposal: Jen
# 10 April 4	Parsing and IE (Andrew out of town) Three Generative, Lexicalised Models for Statistical Parsing. Michael Collins, 1997. A Novel Use of Statistical Parsing to Extract Information from Text. Scott Miller et al 2000. (Additional optional reading: Parsing the Wall Street Journal using a Lexical-Functional Grammar and Discriminative Estimation Techniques. Riezler, et al, 2002.) Point Presentations: Introduction to PCFG parsing & inside-outside algorithm: ___Brent Heeringa___ Collins paper: __Vanessa____ Miller paper: ___Wei__ Riezler paper: __Brent?____ Top-10: __Peter__
# 11 April 11	Reference-Matching, Co-reference, Identity Uncertainty and other Relations Probabilistic Reasoning for Entity & Relation Recognition. D Roth and W. Yih. 2002. Unpublished paper on relational models of IE. (Additional optional reading: Representing Sentence Structure in Hidden Markov Models for Information Extraction. Mark Craven. 2001. Identity Uncertainty. Stuart Russell. 2001. Coreference for NLP Applications. Thomas Morton, 2000. Learning to Match and Cluster Entity Names. Cohen and Richman. 2001. Identity Uncertainty and Citation Matching. Pasula et al. 2002. ) Point Presentations: Top-10: ________________
# 12 April 18	Semi-supervised Learning for IE Unsupervised Models for Named Entity Classification. Michael Collins and Yoram Singer, 1999. Learning Dictionaries for Information Extraction by Multi-Level Bootstrapping, Ellen Riloff and Rosie Jones. Combining Labeled and Unlabeled Data with Co-Training. A. Blum and T. Mitchell, 1998. Learning with labeled and unlabeled data. M. Seeger, 2001. Text Classification from Labeled and Unlabeled Documents. K. Nigam et al. 1999. Information regularization with partially labeled data. M. Szummer and T. Jaakkola. 2002. Learning with Scope, with Application to Information Extraction and Classification. D Blei, et al. 2001. Latent Dirichlet Allocation An Introduction to Variational Methods for Graphical Methods. M. Jordan et al. 1998. Point Presentations: Top 10: _________ Introduction to Variational Methods: __Andrew__ Introduction to Co-training: __Jen__ Overview of Szummer & Jaakkkola: _________ Overview of learning with labeled and unlabel data (Seeger paper): __Wei?___
# 13 April 25	Project Presentations
# 14 May 2	Project Presentations
# 15 May 9	Project Presentations ...and wrap-up

Additional Topics:

Kernel Methods for Text
(Yes, Friday afternoon before Spring Break!)
Introduction to Large Margin Classifiers. Smola, Bartlett, Schoelkopf, Schuurmans
Maximum entropy discrimination. T. Jaakkola, M. Meila, and T. Jebara. 1999.
String Matching Kernels for Text Classification. H. Lodhi, C. Saunders, N. Cristianini, C. Watkins, J. Shawe-Taylor
(Additional optional reading:
Text Categorization with Support Vector Machines. Thorsten Joachims. 1998.
Some SVM IE paper, Gaussian Processes)
Point presentations:
Top-10: ________
SVM overview: _____________
Connections between MaxEnt and SVMs: ____________
Explanation of string kernels: _____________

Integration of IE with Data Mining
Ray Mooney paper
Dan Roth paper

Wrapper Induction and Multi-modal IE
Boosted Wrapper Induction. Kushmeric and Frietag
LDA model of images and captions. Blei and others.
Something from InfoMedia project at CMU.