Friday 1:30-4pm, CS Rm. 203

Instructor: Andrew McCallum, CS Rm 242, 545-1323

The Web is the world's largest knowledge base. However, its data is in a form intended for human reading, not manipulation, mining and reasoning by computers. Today's search engines help people find web pages. Tomorrow's search engines will also help people find "things" (like people, jobs, companies, products), facts and their relations.

Information extraction is the process of filling fields in a database by automatically extracting sub-sequences of human readable text. It is a rich and difficult problem involving the need to combine many sources of evidence using complex models that have many parameters---all estimated with limited labeled training data.

This course will survey many of the sub-problems and methods of information extraction, including use of finite state machines and context-free grammars, language and formatting features, generative and conditional models, rule-learning and Bayesian techniques. We will discuss segmentation of text streams, classification of segments into fields, association of fields into records, and clustering and de-duplication of records.

Along the way we will explore many of the mainstays of statistical modeling, including maximum likelihood, expectation maximization, estimation of multinomial and Dirichlet distributions, maximum entropy methods, discriminative training, Bayesian networks, factorial Markov models, variational approximations, mixture models, semi-supervised training methods.

Most of all we will have a tremendous amount of fun together learning new things in a dynamic, challenging, yet safe-for-silly-questions environment.Target class size: 15 or less.

CompSci 689 (Machine Learning), or

Stats 511 (Computational Multivariate Analysis), or

similar background with permission of instructor.

30% Classroom discussion

20% Research point presentations

10% Reading response papers (due Thursday noon, electronically submitted, late submission not accepted)

10% Quizes (lowest quiz grade dropped)

30% Research project: proposal report and presentation, final report and presentationA r

eading response paperis a half page or less of plain text that gives ~1-3 insightful sentences each on (1) a summary of the paper's main point, (2) something you liked, (3) a critique of some aspect, (4) something you didn't understand or a question. Write your response in plain ASCII text, put it in a file called "response" on loki.cs, and then deposit it by running~mccallum/public_html/courses/ie2003/bin/submit.pl response

Aresearch point presentationis a 10-20 minute in-class presentation on an assigned research question or point that is related to the reading. Examples include: (1) give an introduction to the mechanics of AdaBoost, (2) compare the two different kinds of "shrinkage" in the two assigned readings, (3) give an introduction to string kernels and why they are interesting, (4) walk the class through the derivation of the "gain" in the "Inducing Features..." paper.

Each student will do a reading response paper for every assigned paper, multiple research point presentations, and one research project. All must be the student's own work.

# 1
January 31 |
Class Introduction and Outline.
Self-introductions IE overview slides Point Presentations (Andrew McCallum)
Naive Bayes data/likelihood/inference/estimation
Derivation of the Maximum
Likelihood Estimate, via Lagrange Multipliers |

#2
February 7 |
HMMs for IE & Named Entity Extraction
An Algorithm that Learns What's in a Name. Daniel Bikel, Richard Schwartz and Ralph Weischedel, 1999. Information Extraction with HMMs and Shrinkage. Dayne Frietag and Andrew McCallum, 1999. Point presentations
Named entity data and Identifinder
error analysis: Hema Ragavan
Comparison of shrinkage in each
model: Jeremy Pickens
Reading responses Top-10:
Brent Heeringa |

# 3
February 14 |
Maximum Entropy Classification
A maximum entropy approach to natural language processing. A. Berger, S. Della Pietra and V. Della Pietra, 1996. Using Maximum Entropy for Text Classification. Kamal Nigam, John Lafferty, Andrew McCallum, 1999. A comparison of algorithms for maximum entropy parameter estimation. Robert Malouf, 2002. Point presentations:
MaxEnt data/likelihood/inference/estimation:
Andrew McCallum
Generative vs Conditional MaxEnt:
Ramesh Nallapati
BFGS overview and intuition:
Aron Culotta
Review of MaxEnt uses in the
HLT literature: Fernando Diaz
Reading responses Top-10:
David Stracuzzi: |

# 4
February 21 |
Conditional Finite State Models
Maximum Entropy Markov Models for Information Extraction and Segmentation. Andrew McCallum, Dayne Freitag and Fernando Pereira, 2000. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. John Lafferty, Andrew McCallum and Fernando Pereira, 2001 (Additional optional reading: A Maximum Entropy Part-Of-Speech Tagger. Adwait Ratnaparkhi, 1996.) Point Presentations:
HMM data/likelihood/inference/estimation:
Aron Culotta
MEMM & CRF data/likelihood/inference/estimation:
Andrew McCallum
Presentation of Collins' paper,
Discriminative
Training Methods for Hidden Markov Models: Theory and Experiments with
Perceptron Algorithms. 2002: ___
Reading responses Top-10:
Vanessa
Project Proposals: Vanessa, Ramesh, Wei, Fernando, Jeremy |

# 5
February 28 |
Conditional Finite State Models, Round
2
Shallow Parsing with Conditional Random Fields. Fei Sha and Fernando Pereira, 2003. (Additional optional reading: Efficient Training of Conditional Random Fields. Hanna Wallach, 2002.) Point Presentations:
Last week's Top-10 again: Vanessa
CRFs: Andrew
Top-10: Pippin
Project Proposals: Ben & Joshua, Jerod, Hema, Peter, Andy |

# 6
March 7 |
Feature Induction and Boosting
Inducing Features of Random Fields. Stephen Della Pietra, Vincent Della Pietra, John Lafferty, 1995. (Skipping section 4) Boosting Applied to Tagging and PP Attachment. Steven Abney and Robert E. Schapire and Yoram Singer, 1999. (Additional optional reading: Transformation-Based Error-Driven Learning and Natural Language Processing. Eric Brill, 1995.) Point Presentations: Overview of Boosting: David Introduction to Transformation-Based Learning: Ben Review of "Gain" in Della Pietra el al.: Andrew |

# 7
March 14 |
Feature Induction and Boosting, Round
2
Toward Optimal Feature Selection. Daphne Koller and Mehran Sahami, 1996. (Additional optional reading: Feature Selection for a Rich HPSG Grammar Using Decision Trees. Chris Manning 2002. Boosting and maximum likelihood for exponential models. Guy Lebanon and John Lafferty, 2002.) Point Presentations: Top-10: Joshua Top-10b: Peter Project Proposals: Khash, Brent, Pippin, David, Alvaro, Aron |

# 8
March 21 |
Spring Break |

# 9
March 28 |
Finite State Structure Induction &
Factorial Markov Models
Inducing Probabilistic Grammars by Bayesian Model Merging. A. Stolcke and S. Omohundro, 1994. Factorial hidden Markov models. Z. Ghahramani, M. Jordan. 1995. (Additional optional reading: Information Extraction with HMM Structures Learned by Stochastic Optimization. Dayne Freitag and Andrew McCallum, 2000. Probabilistic DFA Inference using Kullback-Leibler Divergence and Minimality. F. Thollard, P Dupont, C. Higuera A Coupled HMM for Audio-Visual Speech Recognition. A. Nefian, et al. 2002. Audio-Visual Sound Separation Via Hidden Markov Models. John Hershey and Michael Casey. 2001. Structure learning in conditional probability models via an entropic prior and parameter extinction. Matt Brand. Learning Hidden Markov Model Structure for Information Extraction. K. Seymore, et al. 1999. Factorial Markov Random Fields. J. Kim and R. Zabih, 2002.) Point presentations: Top-10: __Alvaro___ Introduction to factorial finite state machines: __Khash__ Overview of Hershey and Casey: ___Jerod___ Introduction to Bayesian Model Merging: _Andrew__ Overview of Seymore et al.: ___Andy___ Project Proposal: Jen |

# 10
April 4 |
Parsing and IE (Andrew out of town)
Three Generative, Lexicalised Models for Statistical Parsing. Michael Collins, 1997. A Novel Use of Statistical Parsing to Extract Information from Text. Scott Miller et al 2000. (Additional optional reading: Parsing the Wall Street Journal using a Lexical-Functional Grammar and Discriminative Estimation Techniques. Riezler, et al, 2002.) Point Presentations: Introduction to PCFG parsing & inside-outside algorithm: ___Brent Heeringa___ Collins paper: __Vanessa____ Miller paper: ___Wei__ Riezler paper: __Brent?____ Top-10: __Peter__ |

# 11
April 11 |
Reference-Matching, Co-reference, Identity
Uncertainty and other Relations
Probabilistic Reasoning for Entity & Relation Recognition. D Roth and W. Yih. 2002. Unpublished paper on relational models of IE. (Additional optional reading: Representing Sentence Structure in Hidden Markov Models for Information Extraction. Mark Craven. 2001. Identity Uncertainty. Stuart Russell. 2001. Coreference for NLP Applications. Thomas Morton, 2000. Learning to Match and Cluster Entity Names. Cohen and Richman. 2001. Identity Uncertainty and Citation Matching. Pasula et al. 2002. ) Point Presentations: Top-10: ________________ |

# 12
April 18 |
Semi-supervised Learning for IE
Unsupervised Models for Named Entity Classification. Michael Collins and Yoram Singer, 1999. Learning Dictionaries for Information Extraction by Multi-Level Bootstrapping, Ellen Riloff and Rosie Jones. Combining Labeled and Unlabeled Data with Co-Training. A. Blum and T. Mitchell, 1998. Learning with labeled and unlabeled data. M. Seeger, 2001. Text Classification from Labeled and Unlabeled Documents. K. Nigam et al. 1999. Information regularization with partially labeled data. M. Szummer and T. Jaakkola. 2002. Learning with Scope, with Application to Information Extraction and Classification. D Blei, et al. 2001. Latent Dirichlet Allocation An Introduction to Variational Methods for Graphical Methods. M. Jordan et al. 1998. Point Presentations: Top 10: _________ Introduction to Variational Methods: __Andrew__ Introduction to Co-training: __Jen__ Overview of Szummer & Jaakkkola: _________ Overview of learning with labeled and unlabel data (Seeger paper): __Wei?___ |

# 13
April 25 |
Project Presentations |

# 14
May 2 |
Project Presentations |

# 15
May 9 |
Project Presentations
...and wrap-up |

Maximum entropy discrimination. T. Jaakkola, M. Meila, and T. Jebara. 1999.

String Matching Kernels for Text Classification. H. Lodhi, C. Saunders, N. Cristianini, C. Watkins, J. Shawe-Taylor

(Additional optional reading:

Text Categorization with Support Vector Machines. Thorsten Joachims. 1998.

Some SVM IE paper, Gaussian Processes)

Point presentations:

Top-10: ________

SVM overview: _____________

Connections between MaxEnt and SVMs: ____________

Explanation of string kernels: _____________

**Integration of IE with Data Mining**

Ray Mooney paper

Dan Roth paper

Wrapper Induction and Multi-modal IE

Boosted Wrapper Induction. Kushmeric
and Frietag

LDA model of images and captions.
Blei and others.

Something from InfoMedia project at CMU.