What Your Knowledge Base Is Missing And Where To Find It
Knowledge bases, collections of a massive number of facts (RDF triples) in diverse topics, have been widely used in applications, including enhancing search results for multiple major search engines. However, existing knowledge bases are incomplete, with many facts missing, in particular, little-known, long-tailed facts, such as Snow White’s age and the PC chairs of ICML 2015. Augmenting knowledge bases is crucial for the correctness of the applications that use them, and for the quality of the user experience. Completing knowledge bases by adding new facts is not easy. Even though current information extraction systems extract facts from web sources automatically, determining what to extract and from where, and evaluating the quality of the extracted facts require manual effort and are highly dependent on the help of domain experts. In this work, our goal is to reduce this manual effort, by automatically identifying high-quality sources for missing facts. We develop a technique to recommend data-source and topic pairs using automatically generated facts (triples) from multiple information extraction systems. We define a profit function to quantify the quality of a candidate data-source and topic pair, propose a highly scalable summarization pipeline to derive high-profit recommendations, and further evaluate the preliminary results.
This work is ongoing and we welcome suggestions and feedback.
Xiaolan Wang is a third year PhD student in the College of Information and Computer Sciences, University of Massachusetts, Amherst, advised by Prof. Alexandra Meliou. Her research interests include database management, data cleaning, and data integration. She enjoys working on projects with strong theoretical grounding and practical impact. Xiaolan was awarded a Google PhD fellowship in 2015.