Preventing Data Errors with Continuous Testing
by Kıvanç Muşlu, Yuriy Brun, Alexandra Meliou
Abstract:
Today, software systems that rely on data are ubiquitous, and ensuring the data's quality is an increasingly important challenge as data errors result in annual multi-billion dollar losses. While software debugging and testing have received heavy research attention, less effort has been devoted to data debugging: identifying system errors caused by well-formed but incorrect data. We present continuous data testing (CDT), a low-overhead, delay-free technique that quickly identifies likely data errors. CDT continuously executes domain-specific test queries; when a test fails, CDT unobtrusively warns the user or administrator. We implement CDT in the ConTest prototype for the PostgreSQL database management system. A feasibility user study with 96 humans shows that ConTest was extremely effective in a setting with a data entry application at guarding against data errors: With ConTest, users corrected 98.4% of their errors, as opposed to 40.2% without, even when we injected 40% false positives into ConTest's output. Further, when using ConTest, users corrected data entry errors 3.2 times faster than when using state-of-the-art methods.
Citation:
Kıvanç Muşlu, Yuriy Brun, and Alexandra Meliou, Preventing Data Errors with Continuous Testing, in Proceedings of the ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA), 2015, pp. 373–384.
Related:
Extended and revised version of "Data Debugging with Continuous Testing" in ESEC-FSE NI 2013.
Bibtex:
@inproceedings{Muslu15issta,
  author = {K{\i}van{\c{c}} Mu{\c{s}}lu and Yuriy Brun and Alexandra Meliou},
  title =
  {\href{http://people.cs.umass.edu/brun/pubs/pubs/Muslu15issta.pdf}{Preventing 
	Data Errors with Continuous Testing}},
  booktitle = {Proceedings of the ACM SIGSOFT International Symposium on
  Software Testing and Analysis (ISSTA)},
  venue = {ISSTA},
  month = {July},
  year = {2015},
  date = {12--17},
  pages = {373--384},
  address = {Baltimore, MD, USA},
  doi = {10.1145/2771783.2771792},

  previous = {Extended and revised version of "Data
  Debugging with Continuous Testing" in ESEC-FSE NI 2013.},
  note = {Extended and revised version of~\ref{Muslu13ni-fse}. 
	\href{https://doi.org/10.1145/2771783.2771792}{DOI: 10.1145/2771783.2771792}},
  accept = {$\frac{33}{119} \approx 28\%$},
	

  abstract = {Today, software systems that rely on data are ubiquitous, and
  ensuring the data's quality is an increasingly important challenge as data
  errors result in annual multi-billion dollar losses. While software
  debugging and testing have received heavy research attention, less effort
  has been devoted to data debugging: identifying system errors caused by
  well-formed but incorrect data. We present continuous data testing (CDT), a
  low-overhead, delay-free technique that quickly identifies likely data
  errors. CDT continuously executes domain-specific test queries; when a test
  fails, CDT unobtrusively warns the user or administrator. We implement CDT
  in the ConTest prototype for the PostgreSQL database management system. A
  feasibility user study with 96 humans shows that ConTest was extremely
  effective in a setting with a data entry application at guarding against
  data errors: With ConTest, users corrected 98.4% of their errors, as
  opposed to 40.2% without, even when we injected 40% false positives into
  ConTest's output. Further, when using ConTest, users corrected data entry
  errors 3.2 times faster than when using state-of-the-art methods.},

  fundedBy = {NSF CCF-1349784, NSF IIS-1421322, NSF CCF-1446683, 
  NSF CCF-1453474, Google Inc. via the Faculty Research Award, 
  Microsoft Research via a SEIF award},
}