by Xiaolan Wang, Xin Luna Dong, Alexandra Meliou

Abstract:

A lot of systems and applications are data-driven, and the correctness of their operation relies heavily on the correctness of their data. While existing data cleaning techniques can be quite effective at purging datasets of errors, they disregard the fact that a lot of errors are systematic, inherent to the process that produces the data, and thus will keep occurring unless the problem is corrected at its source. In contrast to traditional data cleaning, in this paper we focus on data diagnosis: explaining where and how the errors happen in a data generative process. We develop a large-scale diagnostic framework called DataXRay. Our contributions are three-fold. First, we transform the diagnosis problem to the problem of finding common properties among erroneous elements, with minimal domain-specific assumptions. Second, we use Bayesian analysis to derive a cost model that implements three intuitive principles of good diagnoses. Third, we design an efficient, highly-parallelizable algorithm for performing data diagnosis on large-scale data. We evaluate our cost model and algorithm using both real-world and synthetic data, and show that our diagnostic framework produces better diagnoses and is orders of magnitude more efficient than existing techniques.

Citation:

Xiaolan Wang, Xin Luna Dong, and Alexandra Meliou, Data X-Ray: A Diagnostic Tool for Data Errors, in Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), 2015, pp. 1231–1245.

Bibtex:

@inproceedings{WangDM2015, author = {Xiaolan Wang and Xin Luna Dong and Alexandra Meliou}, title = {\href{http://people.cs.umass.edu/ameli/projects/dataxray/papers/modf554-Wang.pdf}{Data {X-Ray}: A Diagnostic Tool for Data Errors}}, abstract ={A lot of systems and applications are data-driven, and the correctness of their operation relies heavily on the correctness of their data. While existing data cleaning techniques can be quite effective at purging datasets of errors, they disregard the fact that a lot of errors are systematic, inherent to the process that produces the data, and thus will keep occurring unless the problem is corrected at its source. In contrast to traditional data cleaning, in this paper we focus on data diagnosis: explaining where and how the errors happen in a data generative process. We develop a large-scale diagnostic framework called DataXRay. Our contributions are three-fold. First, we transform the diagnosis problem to the problem of finding common properties among erroneous elements, with minimal domain-specific assumptions. Second, we use Bayesian analysis to derive a cost model that implements three intuitive principles of good diagnoses. Third, we design an efficient, highly-parallelizable algorithm for performing data diagnosis on large-scale data. We evaluate our cost model and algorithm using both real-world and synthetic data, and show that our diagnostic framework produces better diagnoses and is orders of magnitude more efficient than existing techniques.}, booktitle = {Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD)}, venue = {SIGMOD}, pages = {1231--1245}, year = {2015}, doi = {10.1145/2723372.2750549}, }