Andrew McCallum UMass logo


The main goal of my research is to dramatically increase our ability to mine actionable knowledge from unstructured text. I am especially interested in information extraction from the Web, understanding the connections between people and between organizations, expert finding, social network analysis, and mining the scientific literature & community. Toward this end my group develops and employs various methods in statistical machine learning, natural language processing, information retrieval and data mining---tending toward probabilistic approaches and graphical models. For more information see our current projects and publications.


  • We are building an "open reviewing" system for ICLR 2013 and other venues.  If you are interested in alternative approaches to peer review, please talk with me!
  • FACTORIE is a toolkit for deployable probabilistic modeling, implemented as a software library in Scala. It provides its users with a succinct language for creating relational factor graphs, estimating parameters and performing inference.
  • I was the General Chair of ICML 2012, with Program Chairs Joelle Pineau and John Langford.
  • Generalized Expectation is an accurate way to train models by labeling features.
  • We have publicly launched Rexa, a new research paper search engine. It is a sibling to CiteSeer and Google Scholar, except that it provides search and browsing over more "object types", including not just papers, but also people, grants and topics.
  • Charles Sutton and I have a comprehensive introduction to conditional random fields now published by Foundations and Trends in Machine Learning.
  • I've written an introduction to information extraction by machine learning, intended for an audience that doesn't know machine learning. Information Extraction: Distilling Structured Data from Unstructured Text . Andrew McCallum. ACM Queue, Volume 3, Number 9, November 2005.
  • MALLET is a Java toolkit for machine learning applied to natural language. It provides facilities for document classification, information extraction, part-of-speech tagging, noun phrase segmentation, general finite state transducers and classification, and much more---all desgined to be extremely efficient for large data and feature sets. Although quite mature in functionality, documentation is still sparse.
  • An analysis of topical trends in the five years of ICML before 2008.
  • Three of my papers made it into CiteSeer's list of most cited computer science papers.