Data X-Ray: A Diagnostic Tool for Data Errors
by Xiaolan Wang, Xin Luna Dong, Alexandra Meliou
Abstract:
A lot of systems and applications are data-driven, and the correctness of their operation relies heavily on the correctness of their data. While existing data cleaning techniques can be quite effective at purging datasets of errors, they disregard the fact that a lot of errors are systematic, inherent to the process that produces the data, and thus will keep occurring unless the problem is corrected at its source. In contrast to traditional data cleaning, in this paper we focus on data diagnosis: explaining where and how the errors happen in a data generative process. We develop a large-scale diagnostic framework called DataXRay. Our contributions are three-fold. First, we transform the diagnosis problem to the problem of finding common properties among erroneous elements, with minimal domain-specific assumptions. Second, we use Bayesian analysis to derive a cost model that implements three intuitive principles of good diagnoses. Third, we design an efficient, highly-parallelizable algorithm for performing data diagnosis on large-scale data. We evaluate our cost model and algorithm using both real-world and synthetic data, and show that our diagnostic framework produces better diagnoses and is orders of magnitude more efficient than existing techniques.
Citation:
Xiaolan Wang, Xin Luna Dong, and Alexandra Meliou, Data X-Ray: A Diagnostic Tool for Data Errors, in Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), 2015, pp. 1231–1245.
Bibtex:
@inproceedings{WangDM2015,
  author    = {Xiaolan Wang and
               Xin Luna Dong and
               Alexandra Meliou},
  title     = {\href{http://people.cs.umass.edu/ameli/projects/dataxray/papers/modf554-Wang.pdf}{Data {X-Ray}: A Diagnostic Tool for Data Errors}},
  abstract ={A lot of systems and applications are data-driven, and the
  correctness of their operation relies heavily on the correctness of their
  data. While existing data cleaning techniques can be quite effective at
  purging datasets of errors, they disregard the fact that a lot of errors are
  systematic, inherent to the process that produces the data, and thus will
  keep occurring unless the problem is corrected at its source. In contrast to
  traditional data cleaning, in this paper we focus on data diagnosis:
  explaining where and how the errors happen in a data generative process.
  
  We develop a large-scale diagnostic framework called DataXRay. Our
  contributions are three-fold. First, we transform the diagnosis problem to
  the problem of finding common properties among erroneous elements, with
  minimal domain-specific assumptions. Second, we use Bayesian analysis to
  derive a cost model that implements three intuitive principles of good
  diagnoses. Third, we design an efficient, highly-parallelizable algorithm
  for performing data diagnosis on large-scale data. We evaluate our cost
  model and algorithm using both real-world and synthetic data, and show that
  our diagnostic framework produces better diagnoses and is orders of
  magnitude more efficient than existing techniques.},
  booktitle = {Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD)},
  venue = {SIGMOD},
  pages = {1231--1245},
  year      = {2015},
  doi = {10.1145/2723372.2750549},
}