UMass Machine Learning and Friends Lunch | Main / Inferring And Exploiting Relational Structure In Large Text Collections

Abstract: The digitization of knowledge and concerted retrospective scanning projects are making overwhelming amounts of text in diverse domains, genres, and languages available to readers and researchers. To make this data useful, our group is working on improving OCR, language modeling, syntactic analysis, information extraction, and information retrieval. I will focus in particular on problems of inferring the relational structure latent in large collections of documents, such as books, web pages, patent applications, grant proposals, and social media postings. Which books or passages quote, translate, paraphrase, and cite each other? This research requires improvements in modeling translation and other forms of similarity, as well as improvements in efficiently comparing large numbers of passages. Finally, I will discuss how passage similarity relations can be used to improve tasks such as named-entity recognition and syntactic parsing.

Bio: David Smith is a Research Assistant Professor in the Computer Science Department at the University of Massachusetts, Amherst, where he conducts research on natural language processing, computational linguistics, information retrieval, digital libraries, and machine translation.