Data X-Ray: Diagnosing errors in data systems


Abstract:
Poor data quality is estimated to cost the US economy more than $600 billion per year and erroneous price data in retail databases alone cost the US consumers $2.5 billion each year. Existing data cleaning techniques can be quite effective at purging datasets of errors, they disregard the fact that a lot of errors are systematic, inherent to the process that produces the data, and thus will keep occurring unless the problem is corrected at its source. In contrast to traditional data cleaning, in this project we focus on data diagnosis: explaining where and how the errors happen in a data generative process. [PDF]   [SLIDES]   [DEMO]   [CODE]
Collaborators:

Xin Luna Dong (Amazon)   Yue Wang (Umass Amherst, demo)   Mary Feng (University of Iowa, demo) 

QFix: Diagnosing errors through query histories


Abstract:
Data-driven applications rely on the correctness of their data to function properly and effectively. Errors in data can be incredibly costly and disruptive, leading to loss of revenue, incorrect conclusions, and misguided policy decisions. While data cleaning tools can purge datasets of many errors before the data is used, applications and users interacting with the data can introduce new errors. Subsequent valid updates can obscure these errors and propagate them through the dataset causing more discrepancies. Even when some of these discrep- ancies are discovered, they are often corrected superficially, on a case-by-case basis, further obscuring the true underlying cause, and making detection of the remaining errors harder. In this project, we propose QFix, a framework that derives explanations and repairs for discrepancies in relational data, by analyzing the effect of queries that operated on the data and identifying potential mistakes in those queries. [PDF]   [DEMO]
Collaborators:

Eugene Wu (Columbia University)

MIDAS: Using the Wealth of Web Sources to Fill Knowledge Gaps


Abstract:
Knowledge bases, massive collections of facts (RDF triples) in diverse topics, support vital modern applications, such as enhancing search results for several major search engines. However, existing knowledge bases are incomplete, with many facts missing, in particular, little-known, long-tailed facts. Augmenting knowledge bases is crucial for the correctness and effectiveness of the applications. Our goal is to identify web sources that contain rich new information for knowledge base augmentation and generate descriptions for their con- tent. We model these content descriptions with the novel concept of web source slices. In this project, our goal is to reduce this manual effort, by automatically identifying high-quality web source slices for augmenting knowledge bases.
[Tech Report]  
Collaborators:

Xin Luna Dong (Amazon)   Yang Li (Google)