PhD Student
Office 264
Dept of Computer Science
140 Governors Drive
University of Massachusetts
Amherst, MA (USA) 01003


Phone: (562) 726 4373
sameer AT cs.umass.edu

Latest News

I have moved to Univ of Washington, please visit my new site.

About

I'm a PhD student in Computer Science at UMass Amherst. I'm working with Andrew McCallum, as part of the Information Extraction and Synthesis Lab (IESL) on various learning and approximate inference techniques for large graphical models. I'm the chief maintainer for Factorie and have worked on some of the interesting machine learning problems within Rexa.

I interned at Microsoft Research, Cambridge (UK) last summer, working on Bayesian models of relational databases with Thore Graepel. I interned at Google Research at Mountain View, CA in 2010, where I worked with Amar Subramanya and Fernando Pereira on inference for large graphical models, with cross-document coreference as the task. In summer 2009, I worked with the Advertising Sciences team in Yahoo! Labs on extracting entities from ads using minimal supervision. Before I started my PhD, I interned for two semesters at Google Pittsburgh, where I got a chance to apply machine learning to some of the biggest data sets available.

I co-chaired the fourth North-East Students Colloquium on Artificial Intelligence (NESCAI) 2010 with David Mimno. In December 2011, I was one of the main organizers of the popular Big Learning workshop on large-scale machine learning. I organized the ICML 2012 workshop on Inferning: Interactions between Inference and Learning, and am currently organizing the NIPS Big Learning 2012 workshop.

I received the Yahoo! Key Scientific Challenges Award for the year 2010-2011 (yahoo link, umass story). For the year 2009-2010, I was granted the Department Award for Accomplishments in Search and Mining (sponsored by Yahoo!) by the Computer Science Department. I was also awarded the Graduate School Fellowship by the university for the year 2010-2011.

Before starting my PhD, I finished my MS in Computer Science from Vanderbilt in May 2007, where I worked with Doug Fisher and Julie Adams. I grew up in New Delhi, India, got my Bachelors in Electrical Engineering from NSIT, and attended high school at Sardar Patel Vidyalaya.

Research Interests

  • Machine Learning
  • Information Extraction and NLP
  • Large datasets and Scalability
  • Semi-Supervised Learning
  • Approximate Inference in Graphical Models
  • Reinforcement Learning
Word cloud from my publications

Recent Publications

  • J. Zheng, L. Vilnis, S. Singh, J. Choi, A. McCallum
    Dynamic Knowledge-Base Alignment for Coreference Resolution
    Conference on Computational Natural Language Learning (CoNLL), 2013
    PDF, ,

Coreference resolution systems can benefit greatly from inclusion of global context, and a number of recent approaches have demonstrated improvements when precomputing an alignment to external knowledge sources. However, since alignment itself is a challenging task and is often noisy, existing systems either align conservatively, resulting in very few links, or combine the attributes of multiple candidates, leading to a conflation of entities. Our approach instead performs joint inference between within-document coreference and entity linking, maintaining ranked lists of candidate entities that are dynamically merged and reranked during inference. Further, we incorporate a large set of surface string variations for each entity by using anchor texts from the web that link to the entity. These forms of global context enables our system to improve classifier-based coreference by 1.09 B3 F1 points, and improve over the previous state-of-art by 0.41 points, thus introducing a new state-of-art result on the ACE 2004 data.

@inproceedings{zheng13:dynamic,
	Author = {Jiaping Zheng and Luke Vilnis and Sameer Singh and Jinho Choi and Andrew McCallum},
	Booktitle = {Conference on Computational Natural Language Learning (CoNLL)},
	Title = {Dynamic Knowledge-Base Alignment for Coreference Resolution},
	Year = {2013}}
  • S. Singh, A. Subramanya, F. Pereira, A. McCallum
    Wikilinks: A Large-scale Cross-Document Coreference Corpus Labeled via Links to Wikipedia
    University of Massachusetts Amherst, CMPSCI Technical Report, UM-CS-2012-015, 2012
    PDF, ,

Cross-document coreference resolution is the task of grouping the entity mentions in a collection of documents into sets that each represent a dis- tinct entity. It is central to knowledge base construction and also useful for joint inference with other NLP components. Obtaining large, organic labeled datasets for training and testing cross-document coreference has previously been difficult. This paper presents a method for automatically gathering mas- sive amounts of naturally-occurring cross-document reference data. We also present the Wikilinks dataset comprising 40 million mentions of over 3 mil- lion entities, gathered using this method. Our method is based on finding hyperlinks to Wikipedia from a web crawl and using anchor text as men- tions. In addition to providing large-scale labeled data without human effort, we are able to include many styles of text beyond newswire and many entity types beyond people.

@techreport{singh12:wikilinks,
	Address = {University of Massachusetts},
	Author = {Sameer Singh and Amarnag Subramanya and Fernando Pereira and Andrew McCallum},
	Number = {UM-CS-2012-015},
	Title = {Wikilinks: A Large-scale Cross-Document Coreference Corpus Labeled via Links to Wikipedia},
	Year = {2012}}
  • S. Singh, M. Wick, A. McCallum
    Monte Carlo MCMC: Efficient Inference by Approximate Sampling
    Conference on Empirical Methods in Natural Language Processing and Natural Language Learning (EMNLP-CoNLL), 2012
    PDF, Slides, ,
Conditional random fields and other graphical models have achieved state of the art results in a variety of tasks such as coreference, relation extraction, data integration, and parsing. Increasingly, practitioners are using models with more complex structure---higher tree-width, larger fan-out, more features, and more data---rendering even approximate inference methods such as MCMC inefficient. In this paper we propose an alternative MCMC sampling scheme in which transition probabilities are approximated by sampling from the set of relevant factors. We demonstrate that our method converges more quickly than a traditional MCMC sampler for both marginal and MAP inference. In an author coreference task with over 5 million mentions, we achieve a 13 times speedup over regular MCMC inference.
@inproceedings{singh12:mcmcmc,
	Author = {Sameer Singh and Michael Wick and Andrew McCallum},
	Booktitle = {Conference on Empirical Methods in Natural Language Processing and Natural Language Learning (EMNLP-CoNLL)},
	Title = {Monte Carlo MCMC: Efficient Inference by Approximate Sampling},
	Year = {2012}}
  • S. Singh, T. Graepel
    Compiling Relational Database Schemata into Probabilistic Graphical Models
    Neural Information Processing Systems (NIPS), Workshop on Probabilistic Programming, 2012
    PDF, arXiv, Project Page, ,

Instead of requiring a domain expert to specify the probabilistic dependencies of the data, in this work we present an approach that uses the relational DB schema to automatically construct a Bayesian graphical model for a database. This resulting model contains customized distributions for columns, latent variables that cluster the data, and factors that reflect and represent the foreign key links. Experiments demonstrate the accuracy of the model and the scalability of inference on synthetic and real-world data.

@inproceedings{singh12:compiling,
	Author = {Sameer Singh and Thore Graepel},
	Booktitle = {Neural Information Processing Systems (NIPS), Workshop on Probabilistic Programming},
	Title = {Compiling Relational Database Schemata into Probabilistic Graphical Models},
	Year = {2012}}
  • S. Singh, G. Druck, A. McCallum
    Constraint-Driven Training of Complex Models Using MCMC
    University of Massachusetts Amherst, CMPSCI Technical Report, UM-CS-2012-032, 2012
    PDF, ,

Standard machine learning approaches require labeled data, and labeling data for each task, language, and domain of interest is not feasible. Con- sequently, there has been much interest in developing training algorithms that can leverage constraints from prior knowledge to augment or replace la- beled data. Most previous work in this area assumes that there exist efficient inference algorithms for the model being trained. For many NLP tasks of interest, such as entity resolution, complex models that require approximate inference are advantageous. In this paper we study algorithms for training complex models using constraints from prior knowledge. We propose an MCMC-based approximation to Generalized Expectation (GE) training, and compare it to Constraint-Driven SampleRank (CDSR). Sequence labeling ex- periments demonstrate that MCMC GE closely approximates exact GE, and that GE can substantially outperform CDSR. We then apply these methods to train densely-connected citation resolution models. Both methods yield highly accurate models (up to 94% mean pairwise F1) with only two simple constraints.

@techreport{singh12:mcmc-ge,
	Address = {University of Massachusetts},
	Author = {Sameer Singh and Gregory Druck and Andrew McCallum},
	Number = {UM-CS-2012-032},
	Title = {Constraint-Driven Training of Complex Models Using MCMC},
	Year = {2012}}
  • M. Wick, S. Singh, A. McCallum
    A Discriminative Hierarchical Model for Fast Coreference at Large Scale
    Association for Computational Linguistics (ACL), 2012
    PDF, ,
Coming Soon!
@inproceedings{wick12:a-discriminative,
	Author = {Michael Wick and Sameer Singh and Andrew McCallum},
	Booktitle = {Association for Computational Linguistics (ACL)},
	Title = {A Discriminative Hierarchical Model for Fast Coreference at Large Scale},
	Year = {2012}}

.. list of publications ..