by Xiaolan Wang, Xin Luna Dong, Alexandra Meliou
Abstract:
A lot of systems and applications are data-driven, and the correctness of their operation relies heavily on the correctness of their data. While existing data cleaning techniques can be quite effective at purging datasets of errors, they disregard the fact that a lot of errors are systematic, inherent to the process that produces the data, and thus will keep occurring unless the problem is corrected at its source. In contrast to traditional data cleaning, in this paper we focus on data diagnosis: explaining where and how the errors happen in a data generative process. We develop a large-scale diagnostic framework called DataXRay. Our contributions are three-fold. First, we transform the diagnosis problem to the problem of finding common properties among erroneous elements, with minimal domain-specific assumptions. Second, we use Bayesian analysis to derive a cost model that implements three intuitive principles of good diagnoses. Third, we design an efficient, highly-parallelizable algorithm for performing data diagnosis on large-scale data. We evaluate our cost model and algorithm using both real-world and synthetic data, and show that our diagnostic framework produces better diagnoses and is orders of magnitude more efficient than existing techniques.
Citation:
Xiaolan Wang, Xin Luna Dong, and Alexandra Meliou, Data X-Ray: A Diagnostic Tool for Data Errors, in Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), 2015, pp. 1231–1245.
Bibtex:
@inproceedings{WangDM2015,
author = {Xiaolan Wang and
Xin Luna Dong and
Alexandra Meliou},
title = {\href{http://people.cs.umass.edu/ameli/projects/dataxray/papers/modf554-Wang.pdf}{Data {X-Ray}: A Diagnostic Tool for Data Errors}},
abstract ={A lot of systems and applications are data-driven, and the
correctness of their operation relies heavily on the correctness of their
data. While existing data cleaning techniques can be quite effective at
purging datasets of errors, they disregard the fact that a lot of errors are
systematic, inherent to the process that produces the data, and thus will
keep occurring unless the problem is corrected at its source. In contrast to
traditional data cleaning, in this paper we focus on data diagnosis:
explaining where and how the errors happen in a data generative process.
We develop a large-scale diagnostic framework called DataXRay. Our
contributions are three-fold. First, we transform the diagnosis problem to
the problem of finding common properties among erroneous elements, with
minimal domain-specific assumptions. Second, we use Bayesian analysis to
derive a cost model that implements three intuitive principles of good
diagnoses. Third, we design an efficient, highly-parallelizable algorithm
for performing data diagnosis on large-scale data. We evaluate our cost
model and algorithm using both real-world and synthetic data, and show that
our diagnostic framework produces better diagnoses and is orders of
magnitude more efficient than existing techniques.},
booktitle = {Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD)},
venue = {SIGMOD},
pages = {1231--1245},
year = {2015},
doi = {10.1145/2723372.2750549},
}