NSF ITR: Unified Graphical Models of Information Extraction and Data Mining with Application to Social Network Analysis

Computer Science Department
140 Governors Drive, University of Massachusetts, Amherst, MA 01003-9264

NSF IIS 0326249

Summary

This project aims to improve our ability to data mine information previously locked in unstructured natural language text. It focuses on developing novel statistical models for information extraction and data mining that have such tight integration that the boundaries between them disappear.
Current information extraction methods populate slots in a database by identifying relevant subsequences of text, but they are usually unaware of the emerging patterns and regularities in the database. Current data mining methods begin from a populated database, and they are often unaware of where the data came from, or its inherent uncertainties. Consequentially the accuracy of both suffers, and significant mining of complex text sources is beyond reach.
This project uses probabilistic graphical models that make extraction and mining decisions with a common inference procedure. Such models promise significant gains in accuracy and capability, as well as an opportunity for deeper understanding of the role of top-down and bottom-up processing in language and understanding.
The project grounds this work in a real-world application domain by constructing a Web portal about scientific research---its publications, people, venues, institutions, and funding---enabling insights into the flow of scientific ideas.

Personnel

Principal Investigators: Andrew McCallum, PI; David Jensen, co-PI
Post doctoral fellow: Fuchun Peng
Graduate students: Charles Sutton, Aron Culotta, Kedar Bellare, Pallika Kanani, Michael Wick, David Mimno Michael Hay, Lisa Freidland.
Technical staff: Adam Saunders, Cynthia Loiselle, Agustin Schapira

Publications

Peng, F. and McCallum, A., "Accurate Information Extraction from Research Papers using Conditional Random Fields", Proceedings of HLT-NAACL 04, Boston, Massachusetts, vol. , (2003), p. 329.
Kristjansson, T., Culotta, A., Viola, P., and McCallum, A., "Interactive Information Extraction with Constrained Conditional Random Fields", Sixteenth Innovative Applications of AI Conference (AAAI 2004),San Jose,CA, vol. , (2004), p. 412.
Culotta, Aron and McCallum, Andrew, "Confidence Estimation for Information Extraction", Poster presentation in Proceedings of HLT-NAACL 2004, vol. , (2004), p. 109.
Sutton,C., Rohanimanesh, K.and McCallum, A., "Dynamic Conditional Random Fields: Factorized Probabilistic Models for Labeling and Segmenting Sequence Data", Proceedings of The Twenty-First International Conference on Machine Learning, vol. , (2004), p. 783.
Ben Wellner, Andrew McCallum, Fuchun Peng and Michael Hay, "An Integrated, Conditional Model of Information Extraction and Coreference with Application to Citation Matching", Proceedings of the Conference on Uncertainty in Artificial Intelligence 2004, vol. , (2004), p. 593.
Aron Culotta, Ron Bekkerman, and Andrew McCallum, "Extracting social networks and contact information from email and the Web", Electronic Proceedings of the Conference on Email and Spam (CEAS) 2004 (www.ceas.cc), vol. , (2004), p. 1.
Sutton, C., and McCallum, A., "Collective Segmentation and Labeling of Distant Entities in Information Extraction", CmpSci Technical Report TR # 04-49, University of Massachusetts, July 2004. Presented at ICML Workshop on Statistical Relational Learning and Its Connections to Other Fields. Banff, Canada, vol. , (2004), p. 1.
Welling, M. and Sutton, C., "Learning in Markov Random Fields with Contrastive Free Energies", Online Proceedings of AISTATS 2005, vol. , (2005), p. 1.
Andrew McCallum, Andres Corrada-Emmanuel, Xuerui Wang, "The Author-Recipient-Topic Model for Topic and Role Discovery in Social Networks", CIIR Technical Report., vol. , (2004), p. 1.
Andrew McCallum and Charles Sutton, "Piecewise Training with Parameter Independence Diagrams: Comparing Globally- and Locally-trained Linear-chain CRFs", in CIIR Technical Report; presented at NIPS 2004 Workshop on Learning with Structured Outputs, vol. , (2004), p. 1.
Weinman, J., Hanson, A., and McCallum, A., "Sign Detection in Natural Images with Conditional Random Fields", in the Proceedings of IEEE International Workshop on Machine Learning for Signal Processing, vol. , (2004), p. 1.
Charles Sutton, Michael Sindelar, and Andrew McCallum, "Feature Bagging: Preventing Weight Undertraining in Structured Discriminative Learning", CIIR Technical Report, vol. , (2005), p. 1.
Charles Sutton and Andrew McCallum, "Fast, Piecewise Training for Discriminative Finite-state and Parsing Models", CIIR Technical Report, vol. , (2005), p. 1.
Ghamrawi, N., McCallum. A., "Collective Multi-label Classification", Proceedings of CIKM 2005, vol. , (2005), p. 195.
Andrew McCallum, Andres Corrada-Emmanuel, Xuerui Wang, "Topic and Role Discovery in Social Networks", Proceedings of Nineteenth International Joint Conference on Artificial Intelligence Edinburgh, vol. , (2005), p. 786.
Aron Culotta and Andrew McCallum, "Joint Deduplication of Multiple Record Types in Relational Data", Poster Presentation in the Proceedings of CIKM 2005, vol. , (2005), p. 257.
Aron Culotta and Andrew McCallum, "Reducing labeling effort for structured prediction tasks", Proceedings of the American Association of Artificial Intelligence (AAAI05); poster presentation, vol. , (2005), p. 746.
Charles Sutton and Andrew McCallum, "Piecewise Training for Undirected Models", Proceedings of the 21st Conference on Uncertainty in Artificial Intelligence, vol. , (2005), p. 568.
Li, Wei, "Semi-Supervised Sequence Modeling with Syntactic Topic Models", Proceedings of the 12th Conference on AI, vol. , (2005), p. 813.
Aron Culotta and David Kulp and Andrew McCallum, "Gene prediction with conditional random fields", CIIR Technical Report, vol. , (2005), p. 1.
Charles Sutton and Andrew McCallum, "Joint Parsing and Semantic Role Labeling", Proceedings of the Ninth Conference on Natural Language Learning, vol. , (2005), p. 225.
Neville, J., t, vol. , (2005), p. 1.
McCallum, A., Wang, X. and Pal, C., "Predictive Random Fields: Multiway Conditional Probability Models for Clustering", UMass CmpSci Technical Report UM-CS-2005-053, vol. , (2005), p. 1.
McCallum, A., Bellare, K. and Pereira, F., "A Conditional Random Field for Discriminatively-trained Finite-state String Edit Distance", Proceedings of the 21st Conference on Uncertainty in AI (UAI-2005), vol. , (2005), p. 388.
Wang, X., Mohanty, N. and McCallum, A., "Group and Topic Discovery from Relations and Text", Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining Workshop on Link Discovery: Issues, Approaches and Applications(LinkKDD), vol. , (2005), p. 28.
Sutton, C., Pal, C. and McCallum, A., "Sparse Forward-Backward for Fast Training of Conditional Random Fields", Poster presentation, in NIPS 2005 Workshop on Structured Learning for Text and Speech Processing, vol. , (2005), p. 1.
Sutton, C., Pal, C. and McCallum, A., "Sparse forward-backward using minimum divergence beams for fast training of conditional random fields", Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. , (2006), p. 581.
Sutton, C., Sindelar, M. and McCallum, A., "Reducing Weight Undertraining in Structured Discriminative Learning", Proceedings of HLT/NAACL 2006, vol. , (2006), p. 89.
Mann, G., Mimno, D. and McCallum, A., "Bibliometric Impact Measures Leveraging Topic Analysis", Proceedings of Joint Conference on Digital Libraries, vol. , (2006), p. 65.
Aron Culotta and Andrew McCallum and Jonathan Betz, "Integrating Probabilistic Extraction Models and Data Mining to Discover Relations and Patterns in Text", poster presentation in HLT 2006, vol. , (2006), p. 296.
Wick, Michael, Culotta, Aron and McCallum, Andrew, "Learning Field Compatibilities to Extract Database Records from Unstructured Text", Proceedings of Conference on Empirical Methods in Natural Language Processing(EMNLP 2006), vol. , (2006), p. 603.
Aron Culotta and Andrew McCallum, "Practical Markov logic containing first-order quantifiers with application to identity uncertainty", Proceedings of HLT Workshop on Computationally Hard Problems and Joint Inference in Speech and Language Processing, vol. , (2006), p. 41.
Charles Sutton and Andrew McCallum and Khashayar Rohanimanesh, "Dynamic conditional random fields: Factorized probabilistic models for labeling and segmenting sequence data", Journal of Machine Learning Research, vol. , (2006), p. ., " "
Culotta, A. and McCallum, A., "A Conditional Model of Deduplication for Multi-type Relational Data", CIIR Technical Report, vol. , (2005), p. 1.
Sutton, C. and McCallum, A., "An Introduction to Conditional Random Fields for Relational Learning" , bibl. MIT Press, (2006). Book of Collection: Lise Getoor and Ben Taskar, "Introduction to Statistical Relational Learning".
Extracting Social Networks and Contact Information from Email and the Web. Aron Culotta, Ron Bekkerman and Andrew McCallum. Conference on Email and Spam (CEAS) 2004.
An Integrated, Conditional Model of Information Extraction and Coreference with Application to Citation Matching. Ben Wellner, Andrew McCallum, Fuchun Peng, Michael Hay. Conference on Uncertainty in Artificial Intelligence (UAI), 2004.
Dynamic Conditional Random Fields: Factorized Probabilistic Models for Labeling and Segmenting Sequence Data. Charles Sutton, Khashayar Rohanimanesh and Andrew McCallum. ICML 2004.
Interactive Information Extraction with Constrained Conditional Random Fields. Trausti Kristjannson, Aron Culotta, Paul Viola and Andrew McCallum. Nineteenth National Conference on Artificial Intelligence (AAAI 2004). San Jose, CA. (Winner of Honorable Mention Award.)
Accurate Information Extraction from Research Papers using Conditional Random Fields. Fuchun Peng and Andrew McCallum. Proceedings of Human Language Technology Conference and North American Chapter of the Association for Computational Linguistics (HLT-NAACL), 2004.
Confidence Estimation for Information Extraction. Aron Culotta and Andrew McCallum. Proceedings of Human Language Technology Conference and North American Chapter of the Association for Computational Linguistics (HLT-NAACL), 2004, short paper.
A Note on the Unification of Information Extraction and Data Mining using Conditional-Probability, Relational Models. Andrew McCallum and David Jensen. IJCAI'03 Workshop on Learning Statistical Models from Relational Data, 2003.
Dynamic Conditional Random Fields for Jointly Labeling Multiple Sequences. Andrew McCallum, Khashayar Rohanimanesh and Charles Sutton. NIPS*2003 Workshop on Syntax, Semantics, Statistics, 2003.

Acknowlegments

This project is supported in part by The Central Intelligence Agency, the National Security Agency, and the National Science Foundation under NSF grant #IIS-0326249. The work is being performed within the University of Massachusetts Information Extraction and Synthesis Laboratory (IESL), Center for Intelligent Information Retrieval (CIIR), Knowledge Discovery Laboratory (KDL).