Behavioral Execution Comparison: Are Tests Representative of Field Behavior?

Wang, Qianqian; Brun, Yuriy; Orso, Alessandro

doi:10.1109/ICST.2017.36

by Qianqian Wang, Yuriy Brun, Alessandro Orso

Abstract:

Software testing is the most widely used approach for assessing and improving software quality, but it is inherently incomplete and may not be representative of how the software is used in the field. This paper addresses the questions of to what extent tests represent how real users use software, and how to measure behavioral differences between test and field executions. We study four real-world systems, one used by end-users and three used by other (client) software, and compare test suites written by the systems' developers to field executions using four models of behavior: statement coverage, method coverage, mutation score, and a temporal-invariant-based model we developed. We find that developer-written test suites fail to accurately represent field executions: the tests, on average, miss 6.2% of the statements and 7.7% of the methods exercised in the field; the behavior exercised only in the field kills an extra 8.6% of the mutants; finally, the tests miss 52.6% of the behavioral invariants that occur in the field. In addition, augmenting the in-house test suites with automatically-generated tests only marginally improves the tests' behavioral representativeness. These differences between field and test executions, and in particular the finer-grained and more sophisticated ones that we measured using our invariant-based model, can provide insight for developers and suggest a better method for measuring test suite quality.

View PDF

Citation:

Qianqian Wang, Yuriy Brun, and Alessandro Orso, Behavioral Execution Comparison: Are Tests Representative of Field Behavior?, in Proceedings of the 10th IEEE International Conference on Software Testing, Verification, and Validation (ICST), 2017, pp. 321–332.

Bibtex:

@inproceedings{Wang17icst,
  author = {Qianqian Wang and Yuriy Brun and Alessandro Orso},
  title =
  {\href{http://people.cs.umass.edu/brun/pubs/pubs/Wang17icst.pdf}{Behavioral Execution Comparison:
{Are} Tests Representative of Field Behavior?}},
  booktitle = {Proceedings of the 10th IEEE International Conference on
  Software Testing, Verification, and Validation (ICST)},
  venue = {ICST},
  address = {Tokyo, Japan},
  month = {March},
  date = {13--18},
  year = {2017},
  pages = {321--332},
  doi = {10.1109/ICST.2017.36},
  note = {\href{https://doi.org/10.1109/ICST.2017.36}{DOI:
  10.1109/ICST.2017.36}},

  accept = {$\frac{36}{135} \approx 27\%$},

  abstract = {<p>Software testing is the most widely used approach for
  assessing and improving software quality, but it is inherently incomplete
  and may not be representative of how the software is used in the field.
  This paper addresses the questions of to what extent tests represent how
  real users use software, and how to measure behavioral differences between
  test and field executions. We study four real-world systems, one used by
  end-users and three used by other (client) software, and compare test
  suites written by the systems' developers to field executions using four
  models of behavior: statement coverage, method coverage, mutation score,
  and a temporal-invariant-based model we developed. We find that
  developer-written test suites fail to accurately represent field
  executions: the tests, on average, miss 6.2% of the statements and 7.7% of
  the methods exercised in the field; the behavior exercised only in the
  field kills an extra 8.6% of the mutants; finally, the tests miss 52.6% of
  the behavioral invariants that occur in the field. In addition, augmenting
  the in-house test suites with automatically-generated tests only marginally
  improves the tests' behavioral representativeness. These differences
  between field and test executions, and in particular the finer-grained and
  more sophisticated ones that we measured using our invariant-based model,
  can provide insight for developers and suggest a better method for
  measuring test suite quality.</p>},

  fundedBy = {NSF IIS-1239334, NSF CMMI-1234070, NSF CNS-1513055, 
  NSF CCF-1453474.},
}