Andrew McCallum UMass logo


The goal of my current research is to dramatical increase our ability to mine actionable knowledge from unstructured text. I am especially interested in information extraction from the Web, understanding the connections between people and between organizations, expert finding, social network analysis, and mining the scientific literature & community.

Toward this end my group and I develop and employ various methods in statistical machine learning, natural language processing and information retrieval. We tend toward probabilistic approaches, graphical models, and Bayesian methods. Methodologically, our work over the past several divides into two families: (1) conditionally-trained undirected graphical models (conditional random fields)---not just for finite-state sequence modeling, but also more complex relational domains including coreference, alignment, schema matching, and various relational domains. (2) Bayesian generative latent variable models, including "topic models," and "structured topic models", which jointly model text with other structured meta-data. We have been applying these topic models to language modeling, data mining textual databases, and social network analysis.

Unified Information Extraction and Data Mining

Although information extraction and data mining appear together in many applications, their interface in most current deployments would better be described as serial juxtaposition than as tight integration. Information extraction populates slots in a database by identifying relevant subsequences of text, but is usually not aware of the emerging patterns and regularities in the database. Data mining methods begin from a populated database, and are often unaware of where the data came from, or its inherent uncertainties. The result is that the accuracy of both suffers, and significant mining of complex text sources is beyond reach.

We have been researching relational probabilistic models that unify extraction and mining, so that by sharing common inference procedures, they can each overcome the weaknesses of the other. For example, data mining run on a partially-filled database can find patterns that provide "top-down" accuracy-improving constraints to information extraction. Information extraction can provide a much richer set of "bottom-up" hypotheses to data mining if the mining is able to handle additional uncertainty information from extraction.

Conditional Random Fields for Relational Data, Approximate Inference and Learning

The above unified processing requires large-scala joint inference that cannot be performed exactly. We have been developing various methods of MCMC inference methods and corresponding learning approaches aimed specifically at extremely large relational-data domains. Our approach based on Metropolis-Hastings inference and learning by ranking, achieved best-in-the-world coreference resolution on a standard newswire dataset. This work is also quite relevant to recent interest in "combining logic and probability".

Automatic Knowledge Base Construction, with Probabilistic Databases & Human Cooperation

Incorporating uncertainty and probabilistic inference into databases has posed many challenges, often leading to systems that sacrifice modeling power, scalability, or restrict the class of relational algebra formulae under which they are closed.  We have proposed [Wick et al, VLDB, 2009] an alternative approach in which the underlying relational database always represents a single possible world, and an external, unrestricted factor graph encodes a distribution over the set of possible worlds; Markov chain Monte Carlo (MCMC) inference is then used to recover this uncertainty to a desired level of fidelity.  Our approach allows the efficient evaluation of arbitrary SQL queries over probabilistic databases with dependencies expressed by unrestricted graphical models, (including graphical model structure that changes during inference).

As a result, state-of-the-art, machine-learning-based information extraction, entity resolution, schema-aligment can then be efficiently run "inside" the database.  Furthermore, rather than running in pipeline fashion, they can all interoperate in the same scalable infrastructure, imparting the increased accuracy of joint inference. This also provides a convenient any-time functionality to KB construction and query processing.

In addition, we advocate an approach to information integration (including knowledge-base construction by human-machine cooperation) in which the the canonical "true" entities and relations in the database are always inferred from integrated or human-entered data, never injected directly.  Even human "corrections" to the KB are merely additional pieces of evidence input as probabilistic evidence---allowing our system to reason probabilistically about provenance and trust.

MCMC sampling provides efficiency by hypothesizing modifications to possible worlds rather than generating entire worlds from scratch.  Queries are then run over the portions of the world that change, avoiding the onerous cost of running full queries over each sampled world.  We leverage materialzed view maintenance techniques to dramatically speed query processing.  We demonstrate system’s ability to answer relational queries with aggregation over one million tuples, and show additional scalability through use of parallelization.

Extraction, Integration and Mining of Bibliographic Data

Back in the 1997 I conceived of and lead a project at JustResearch that created Cora, one of the first domain-specific search engines over computer science research papers. You can read more about our research on Cora in our IRJ journal paper or a paper presented at the AAAI'99 Spring Symposium. The Cora team also included Kamal Nigam, Kristie Seymore, Jason Rennie, Huan Chang and Jason Reed.

More recently we have been working on an enhanced alternative to Google Scholar, CiteSeer, and other digital libraries of the research literature. Our system, called Rexa, automatically extracts a de-duplicated cross-referenced database of not just papers (and references), but also people, grants, publication venues and institutions. We also perform various kinds of topic and bibliometric impact analysis on this data.

Social Network Analysis with Structured Topic Models

Traditional social network analysis examines the connectivity of entities in a graph. However, in many cases we have data not just about the existence of a graph-edge, but also various properties of the nodes and edges---including large quantities of corresponding textual data. We have used Bayesian latent variable models, variants of "topic models" augmented with non-textual variables to (a) discover roles of people in the sender-receiver structure of a large email collection, (b) discover groups (coalitions) of U.S. senators or U.N. countries from their voting records and the topics of the bills, (c) discover communities of academic researchers from their papers and the venues in which they publish.

Semi-supervised Learning & Alignment Learning in Natural Language

The only way to put natural language learning into the hands of the people is to reduce the burden of labeling training data. Over the years we have worked on various methods of semi-supervised learning that combines small amounts of labeled data with large amounts of unlabeled data. Our most recent work is in Generalized Expectation (GE) criteria, one form of which can be understood as enabling "feature labeling" as opposed to the traditional "instance labeling".

We have also removed the need for human labeled data entirely by leveraging information already in relevant databases, and learning information extractors by discovering CRF-based alignments between database records and unstructured text.

Joint Inference for NLP, Dialog Pragmatics, Perception and Action

As part of a MURI project joint with UPenn, we have begun work on probabilistic modeling of natural language dialog, designing methods that will do joint, unified inference all the way from natural language understanding, through dialog pragmatics, to perception and action in a shared world. This work will leverage our research in large-scale joint inference in CRFs.

Intelligent Understanding of our Email World

As part of the CALO project, we extracted information about people and other entities appearing in email streams, performed large-scale entity resolution, and discovered topics and expertise.

Conditional Probability Models for Sequences and other Relational Data

Back in the 1990's, after having some success using hidden Markov models for information extraction, we found ourselves frustrated by their lack of ability to incorporate many arbitrary, overlapping features of the input sequence, such as capitalization, lexicon memberships, spelling features, and conjunctions of such features in a large window of past and future observations. The same difficulties with non-independent features exist in many generatively-trained models historically used in NLP. We have begun work with conditionally-trained probability models that address these problems. Maximum entropy Markov models are locally-normalized conditional sequence models. Finite-state Conditional Random Fields (CRFs) are globally-normalized models. We have also been working with CRFs for coreference and multi-sequence labeling, analogous to conditionally-trained Dynamic Bayesian Networks (DBNs). We now work with even more complex CRFs, that integrate logic and probability, as described above.

WhizBang Labs

From 2000 through 2002 I was Vice President of Research and Development at WhizBang Labs, a start-up company focusing on information extraction from the Web. We developed sophisticated machine learning extraction systems for numerous application domains---among them, a database of job openings extracted directly from company Web sites (now owned by, corporate information for Dun & Bradstreet and Lexis Nexis, and course syllabi for the U.S. Department of Labor.


In 1996 and 1997 I was part of Tom Mitchell's WebKB project and the CMU Text Learning group.

Reinforcement Learning

In what now seems like a lifetime ago, I was interested in reinforcement learning---especially with hidden state and factored representations. My thesis uses memory-based learning and a robust statistical test on reward in order to learn a structured policy representation that makes perceptual and memory distinctions only where needed for the task at hand. It can also be understood as a method of Value Function Approximation. The model learned is an order-n partially observable Markov decision process. It handles noisy observation, action and reward.

It is related to Ron, Singer and Tishby's Probabilistic Suffix Trees, Leslie Kaelbling's G-algorithm and Andrew Moore's Parti-game. It is distinguished from similar-era work by Michael Littman, Craig Boutilier and others in that it learns both a model and a policy, and is quite practical with infinite-horizon tasks and large state and observation spaces. Follow-on or comparison work has been done by Anders Jonsson, Andy Barto, Will Uther, Natalia Hernandez, Leslie Kaelbling, and Sridhar Mahadevan.

The algorithm, called U-Tree, was demonstrated solving a highway driving task using simulated eye-movements and deictic representations. The simulated environment has about 21000 states, 2500 observations, noise and much hidden state. After about 2 1/2 hours of simulated experience, U-Tree learns a task-specific model of the environment that has only 143 states. It's learned behavior included lane changes to avoid slow vehicles in front, and checking the rear-view mirror to avoid faster vehicles from behind.