Preventing Data Errors with Continuous Testing
by Kıvanç Muşlu, Yuriy Brun, Alexandra Meliou
Abstract:
Today, software systems that use data are ubiquitous, and ensuring the data's quality is an increasingly important challenge as data errors result in annual multi-billion dollar losses. While software debugging and testing have received heavy research attention, less effort has been devoted to data debugging: identifying system errors caused by well-formed but incorrect data. We present continuous data testing (CDT), a low-overhead, delay-free technique that quickly identifies likely data errors. CDT continuously executes domain-specific test queries; when a test fails, CDT unobtrusively warns the user or administrator. We implement CDT in the ConTest prototype for the PostgreSQL database management system. A user study with 96 humans shows that ConTest is extremely effective at guarding against data entry errors: With ConTest, users corrected 98.4% of their errors, as opposed to 40.2% without, even when we injected 40% false positives into ConTest's output. Further, when using ConTest, users corrected data entry errors 3.2 times faster than when using state-of-the-art methods.
Citation:
Kıvanç Muşlu, Yuriy Brun, and Alexandra Meliou, Preventing Data Errors with Continuous Testing, in Proceedings of the ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA), 2015, pp. 373–384.
Bibtex:
@inproceedings{Muslu15issta,
  author = {K{\i}van{\c{c}} Mu{\c{s}}lu and Yuriy Brun and Alexandra Meliou},
  title =
  {\href{http://people.cs.umass.edu/brun/pubs/pubs/Muslu15issta.pdf}{Preventing Data Errors with Continuous Testing}},
  booktitle = {Proceedings of the ACM SIGSOFT International Symposium on
  Software Testing and Analysis (ISSTA)},
  venue = {ISSTA},
  month = {July},
  year = {2015},
  date = {12--17},
  pages = {373--384},
  address = {Baltimore, MD, USA},
  doi = {10.1145/2771783.2771792},
  accept = {$\frac{33}{119} \approx 28\%$},

  abstract = {Today, software systems that use data are ubiquitous, and ensuring the data's
  quality is an increasingly important challenge as data errors result in
  annual multi-billion dollar losses. While software debugging and testing have
  received heavy research attention, less effort has been devoted to data
  debugging: identifying system errors caused by well-formed but incorrect
  data. We present continuous data testing (CDT), a low-overhead, delay-free
  technique that quickly identifies likely data errors. CDT continuously
  executes domain-specific test queries; when a test fails, CDT unobtrusively
  warns the user or administrator. We implement CDT in the ConTest prototype
  for the PostgreSQL database management system. A user study with 96 humans
  shows that ConTest is extremely effective at guarding against data entry
  errors: With ConTest, users corrected 98.4% of their errors, as opposed to
  40.2% without, even when we injected 40% false positives into ConTest's
  output. Further, when using ConTest, users corrected data entry errors 3.2
  times faster than when using state-of-the-art methods.},
}