Andrew McCallum

Contact Info
Bio & Affiliations
Research & Projects
Code & Data
Students & other collab's
Activities & Events

UMass ML Seminar

Research and Projects

The goal of my current research is to dramatical increase our ability to mine actionable knowledge from unstructured text. I am especially interested in information extraction from the Web, understanding the connections between people and between organizations, expert finding, social network analysis, and mining the scientific literature & community.

Toward this end my group and I develop and employ various methods in statistical machine learning, natural language processing and information retrieval. We tend toward probabilistic approaches, graphical models, and Bayesian methods. Methodologically, our work over the past several divides into two families: (1) conditionally-trained undirected graphical models (conditional random fields)---not just for finite-state sequence modeling, but also more complex relational domains including coreference, alignment, schema matching, and various relational domains. (2) Bayesian generative latent variable models, including "topic models," and "structured topic models", which jointly model text with other structured meta-data. We have been applying these topic models to language modeling, data mining textual databases, and social network analysis.

Unified Information Extraction and Data Mining

Although information extraction and data mining appear together in many applications, their interface in most current deployments would better be described as serial juxtaposition than as tight integration. Information extraction populates slots in a database by identifying relevant subsequences of text, but is usually not aware of the emerging patterns and regularities in the database. Data mining methods begin from a populated database, and are often unaware of where the data came from, or its inherent uncertainties. The result is that the accuracy of both suffers, and significant mining of complex text sources is beyond reach.

We have been researching relational probabilistic models that unify extraction and mining, so that by sharing common inference procedures, they can each overcome the weaknesses of the other. For example, data mining run on a partially-filled database can find patterns that provide "top-down" accuracy-improving constraints to information extraction. Information extraction can provide a much richer set of "bottom-up" hypotheses to data mining if the mining is able to handle additional uncertainty information from extraction.

Conditional Random Fields for Relational Data, Approximate Inference and Learning

The above unified processing requires large-scala joint inference that cannot be performed exactly. We have been developing various methods of MCMC inference methods and corresponding learning approaches aimed specifically at extremely large relational-data domains. Our approach based on Metropolis-Hastings inference and learning by ranking, achieved best-in-the-world coreference resolution on a standard newswire dataset. This work is also quite relevant to recent interest in "combining logic and probability".

Extraction, Integration and Mining of Bibliographic Data

Back in the 1990's I was the leader of the project at JustResearch that created Cora, a domain-specific search engine over computer science research papers. It currently contains over 50,000 postscript papers. You can read more about our research on Cora in our IRJ journal paper or a paper presented at the AAAI'99 Spring Symposium. The Cora team also included Kamal Nigam, Kristie Seymore, Jason Rennie, Huan Chang and Jason Reed.

More recently we have been working on an enhanced alternative to Google Scholar, CiteSeer, and other digital libraries of the research literature. Our system, called Rexa, automatically extracts a de-duplicated cross-referenced database of not just papers (and references), but also people and grants, and so also publication venues and institutions. We also perform various kinds of topic and bibliometric impact analysis on this data.

Social Network Analysis with Structured Topic Models

Traditional social network analysis examines the connectivity of entities in a graph. However, in many cases we have data not just about the existence of a graph-edge, but also various properties of the nodes and edges---including large quantities of corresponding textual data. We have used Bayesian latent variable models, variants of "topic models" augmented with non-textual variables to (a) discover roles of people in the sender-receiver structure of a large email collection, (b) discover groups (coalitions) of U.S. senators or U.N. countries from their voting records and the topics of the bills, (c) discover communities of academic researchers from their papers and the venues in which they publish.

Semi-supervised Learning & Alignment Learning in Natural Language

The only way to put natural language learning into the hands of the people is to reduce the burden of labeling training data. Over the years we have worked on various methods of semi-supervised learning that combines small amounts of labeled data with large amounts of unlabeled data. Our most recent work is in Generalized Expectation (GE) criteria, one form of which can be understood as enabling "feature labeling" as opposed to the traditional "instance labeling".

We have also removed the need for human labeled data entirely by leveraging information already in relevant databases, and learning information extractors by discovering CRF-based alignments between database records and unstructured text.

Joint Inference for NLP, Dialog Pragmatics, Perception and Action

As part of a MURI project joint with UPenn, we have begun work on probabilistic modeling of natural language dialog, designing methods that will do joint, unified inference all the way from natural language understanding, through dialog pragmatics, to perception and action in a shared world. This work will leverage our research in large-scale joint inference in CRFs.

Intelligent Understanding of our Email World

As part of the CALO project, we are extracting information about people and other entities appearing in email streams.


Intelligent Understanding of our Email World

As part of the CALO project, we extracted information about people and other entities appearing in email streams, performed large-scale entity resolution, and discovered topics and expertise.

Conditional Probability Models for Sequences and other Relational Data

Back in the 1990's, after having some success using hidden Markov models for information extraction, we found ourselves frustrated by their lack of ability to incorporate many arbitrary, overlapping features of the input sequence, such as capitalization, lexicon memberships, spelling features, and conjunctions of such features in a large window of past and future observations. The same difficulties with non-independent features exist in many generatively-trained models historically used in NLP. We have begun work with conditionally-trained probability models that address these problems. Maximum entropy Markov models are locally-normalized conditional sequence models. Finite-state Conditional Random Fields (CRFs) are globally-normalized models. We have also been working with CRFs for coreference and multi-sequence labeling, analogous to conditionally-trained Dynamic Bayesian Networks (DBNs). We now work with even more complex CRFs, that integrate logic and probability, as described above.

WhizBang Labs

From 2000 through 2002 I was Vice President of Research and Development at WhizBang Labs, a start-up company focusing on information extraction from the Web. We developed sophisticated machine learning extraction systems for numerous application domains---among them, a database of job openings extracted directly from company Web sites (now owned by, corporate information for Dun & Bradstreet and Lexis Nexis, and course syllabi for the U.S. Department of Labor.


In 1996 and 1997 I was part of Tom Mitchell's WebKB project and the CMU Text Learning group.

Reinforcement Learning

In what now seems like a lifetime ago, I was interested in reinforcement learning---especially with hidden state and factored representations. My thesis uses memory-based learning and a robust statistical test on reward in order to learn a structured policy representation that makes perceptual and memory distinctions only where needed for the task at hand. It can also be understood as a method of Value Function Approximation. The model learned is an order-n partially observable Markov decision process. It handles noisy observation, action and reward.

It is related to Ron, Singer and Tishby's Probabilistic Suffix Trees, Leslie Kaelbling's G-algorithm and Andrew Moore's Parti-game. It is distinguished from similar-era work by Michael Littman, Craig Boutilier and others in that it learns both a model and a policy, and is quite practical with infinite-horizon tasks and large state and observation spaces. Follow-on or comparison work has been done by Anders Jonsson, Andy Barto, Will Uther, Natalia Hernandez, Leslie Kaelbling, and Sridhar Mahadevan.

The algorithm, called U-Tree, was demonstrated solving a highway driving task using simulated eye-movements and deictic representations. The simulated environment has about 21000 states, 2500 observations, noise and much hidden state. After about 2 1/2 hours of simulated experience, U-Tree learns a task-specific model of the environment that has only 143 states. It's learned behavior included lane changes to avoid slow vehicles in front, and checking the rear-view mirror to avoid faster vehicles from behind.