The goal of my research is to dramatical increase our
ability to mine actionable knowledge from unstructured text. I am
especially interested in
applications to the scientific literature, building knowledge bases of science,revolutionizing peer review, and other knowledge-based approaches to acceleratescientific progress.
Toward this end my group and I do research machine learning and natural language processing, information extraction, knowlege base construction, crowd-sourcing, social network analysis, expertise modeling, knowledge representation, and reasoning. We enjoy developing new applications as well as novel machine learning methodology in both probabilistic graphical models and deep neural networks.
Some current projects
See a larger list of projects on my lab's project page
(Older Research Descriptions Below)
Unified Information Extraction and Data Mining
Although information extraction and data mining appear
together in many applications, their interface in most current
deployments would better be described as serial juxtaposition than as
tight integration. Information extraction populates slots in a database
by identifying relevant subsequences of text, but is usually not aware
of the emerging patterns and regularities in the database. Data mining
methods begin from a populated database, and are often unaware of where
the data came from, or its inherent uncertainties. The result is that
the accuracy of both suffers, and significant mining of complex text
sources is beyond reach.
We have been researching relational probabilistic models
extraction and mining, so that by sharing common inference
procedures, they can each overcome the weaknesses of the other. For
example, data mining run on a partially-filled database can find
patterns that provide "top-down" accuracy-improving constraints to
information extraction. Information extraction can provide a much
richer set of "bottom-up" hypotheses to data mining if the mining is
able to handle additional uncertainty information from extraction.
Conditional Random Fields for
Relational Data, Approximate Inference and Learning
The above unified processing requires large-scala joint
inference that cannot be performed exactly. We have been developing
various methods of MCMC inference methods and corresponding learning
approaches aimed specifically at extremely large relational-data
domains. Our approach based on Metropolis-Hastings inference and
learning by ranking, achieved best-in-the-world
coreference resolution on a standard newswire dataset. This work is
also quite relevant to recent interest in "combining logic and
Automatic Knowledge Base Construction, with Probabilistic Databases & Human Cooperation
Incorporating uncertainty and
probabilistic inference into databases has posed many challenges, often
leading to systems that sacrifice modeling power, scalability, or
restrict the class of relational algebra formulae under which they are
closed. We have proposed [Wick et al, VLDB, 2009] an alternative
approach in which the underlying relational database always represents
a single possible world, and an external, unrestricted factor graph
encodes a distribution over the set of possible worlds; Markov chain
Monte Carlo (MCMC) inference is then used to recover this uncertainty
to a desired level of fidelity. Our approach allows the efficient
evaluation of arbitrary SQL queries over probabilistic databases with
dependencies expressed by unrestricted graphical models, (including
graphical model structure that changes during inference).
As a result, state-of-the-art, machine-learning-based information
extraction, entity resolution, schema-aligment can then be efficiently
run "inside" the database. Furthermore, rather than running in
pipeline fashion, they can all interoperate in the same scalable
infrastructure, imparting the increased accuracy of joint inference.
This also provides a convenient any-time functionality to KB
construction and query processing.
In addition, we advocate an approach to information integration
(including knowledge-base construction by human-machine cooperation) in
which the the canonical "true" entities and relations in the database
are always inferred from integrated or human-entered data, never
injected directly. Even human "corrections" to the KB are merely
additional pieces of evidence input as probabilistic
evidence---allowing our system to reason probabilistically about
provenance and trust.
MCMC sampling provides efficiency by hypothesizing modifications to
possible worlds rather than generating entire worlds from
scratch. Queries are then run over the portions of the world that
change, avoiding the onerous cost of running full queries over each
sampled world. We leverage materialzed view maintenance
techniques to dramatically speed query processing. We demonstrate
system’s ability to answer relational queries with aggregation over one
million tuples, and show additional scalability through use of
Extraction, Integration and Mining of
Back in the 1997 I conceived of and lead a project at
JustResearch that created Cora, one of the first domain-specific search engines over computer science research papers.
You can read more
about our research on Cora in our IRJ journal
paper or a paper
presented at the AAAI'99
Spring Symposium. The Cora team also included Kamal Nigam, Kristie Seymore, Jason Rennie, Huan
Chang and Jason Reed.
More recently we have been working on an enhanced
alternative to Google Scholar,
CiteSeer, and other
digital libraries of the research literature. Our system, called Rexa, automatically extracts a
de-duplicated cross-referenced database of not just papers (and
references), but also people, grants, publication venues
and institutions. We also perform various kinds of topic
impact analysis on this data.
Social Network Analysis with Structured
Traditional social network analysis examines the
connectivity of entities in a graph. However, in many cases we have
data not just about the existence of a graph-edge, but also various
properties of the nodes and edges---including large quantities of
corresponding textual data. We have used Bayesian latent variable
models, variants of "topic models" augmented with non-textual variables
to (a) discover
roles of people in the sender-receiver structure of a large email
collection, (b) discover
groups (coalitions) of U.S. senators or U.N. countries from their
voting records and the topics of the bills, (c) discover
communities of academic researchers from their papers and the
venues in which they publish.
Semi-supervised Learning &
Alignment Learning in Natural Language
The only way to put natural language learning into the hands
of the people is to reduce the burden of labeling training data. Over
the years we have worked on various methods of semi-supervised learning
that combines small amounts of labeled data with large amounts of
unlabeled data. Our most recent work is in Generalized
Expectation (GE) criteria, one form of which can be understood as
labeling" as opposed to the traditional "instance labeling".
We have also removed the need for human labeled data
entirely by leveraging information already in relevant databases, and
learning information extractors by discovering CRF-based
alignments between database
records and unstructured text.
Joint Inference for NLP, Dialog
Pragmatics, Perception and Action
As part of a MURI project joint with UPenn, we have begun
work on probabilistic modeling of natural language dialog, designing
methods that will do joint, unified inference all the way from natural
language understanding, through dialog pragmatics, to perception and
action in a shared world. This work will leverage our research in
large-scale joint inference in CRFs.
Intelligent Understanding of our Email
As part of the CALO
project, we extracted information about people and other entities
appearing in email streams, performed large-scale entity resolution,
and discovered topics and expertise.
Conditional Probability Models for
Sequences and other Relational Data
Back in the 1990's, after having some success using hidden Markov
models for information extraction, we found ourselves frustrated by
their lack of ability to incorporate many arbitrary, overlapping
features of the input sequence, such as capitalization, lexicon
memberships, spelling features, and conjunctions of such features in a
large window of past and future observations. The same difficulties
with non-independent features exist in many generatively-trained models
historically used in NLP. We have begun work with conditionally-trained
probability models that address these problems. Maximum entropy Markov models are
locally-normalized conditional sequence models. Finite-state Conditional Random Fields (CRFs) are
globally-normalized models. We have also been working with CRFs for coreference and multi-sequence labeling, analogous
to conditionally-trained Dynamic Bayesian Networks (DBNs). We now work
with even more complex CRFs, that integrate logic and probability, as
From 2000 through 2002 I was Vice President of Research and
Development at WhizBang Labs,
a start-up company focusing on information extraction from the Web. We
developed sophisticated machine learning extraction systems for
numerous application domains---among them FlipDog.com, a database of job
openings extracted directly from company Web sites (now owned by Monster.com), corporate information
for Dun & Bradstreet and Lexis Nexis, and course syllabi
for the U.S. Department of Labor.
In 1996 and 1997 I was part of Tom Mitchell's WebKB
project and the CMU
Text Learning group.
In what now seems like a lifetime ago, I was interested in
reinforcement learning---especially with hidden state and factored
representations. My thesis
uses memory-based learning and a robust statistical test on reward in
order to learn a structured policy representation that makes perceptual
and memory distinctions only where needed for the task at hand. It can
also be understood as a method of Value Function Approximation.
The model learned is an order-n partially observable Markov
decision process. It handles noisy observation, action and reward.
It is related to Ron, Singer and Tishby's Probabilistic
Suffix Trees, Leslie
Kaelbling's G-algorithm and Andrew Moore's Parti-game.
It is distinguished from similar-era work by Michael Littman, Craig Boutilier
and others in that it learns both a model and a policy, and is quite
practical with infinite-horizon tasks and large state and observation
spaces. Follow-on or comparison work has been done by Anders Jonsson, Andy Barto, Will Uther, Natalia Hernandez, Leslie Kaelbling, and Sridhar Mahadevan.
The algorithm, called U-Tree, was demonstrated
solving a highway driving task using simulated eye-movements and
deictic representations. The simulated environment has about 21000
states, 2500 observations, noise and much hidden state. After about 2
1/2 hours of simulated experience, U-Tree learns a
task-specific model of the environment that has only 143 states.
It's learned behavior included lane changes to avoid slow vehicles in
front, and checking the rear-view mirror to avoid faster vehicles from