Software testing is the most widely used approach for assessing and improving software quality, but it is inherently incomplete and may not be representative of how the software is used in the field. This paper addresses the questions of to what extent tests represent how real users use software, and how to measure behavioral differences between test and field executions. We study four real-world systems, one used by end-users and three used by other (client) software, and compare test suites written by the systems' developers to field executions using four models of behavior: statement coverage, method coverage, mutation score, and a temporal-invariant-based model we developed. We find that developer-written test suites fail to accurately represent field executions: the tests, on average, miss 6.2% of the statements and 7.7% of the methods exercised in the field; the behavior exercised only in the field kills an extra 8.6% of the mutants; finally, the tests miss 52.6% of the behavioral invariants that occur in the field. In addition, augmenting the in-house test suites with automatically-generated tests only marginally improves the tests' behavioral representativeness. These differences between field and test executions, and in particular the finer-grained and more sophisticated ones that we measured using our invariant-based model, can provide insight for developers and suggest a better method for measuring test suite quality.
@inproceedings{Wang17icst, author = {Qianqian Wang and Yuriy Brun and Alessandro Orso}, title = {\href{http://people.cs.umass.edu/brun/pubs/pubs/Wang17icst.pdf}{Behavioral Execution Comparison: {Are} Tests Representative of Field Behavior?}}, booktitle = {Proceedings of the 10th IEEE International Conference on Software Testing, Verification, and Validation (ICST)}, venue = {ICST}, address = {Tokyo, Japan}, month = {March}, date = {13--18}, year = {2017}, pages = {321--332}, doi = {10.1109/ICST.2017.36}, note = {\href{https://doi.org/10.1109/ICST.2017.36}{DOI: 10.1109/ICST.2017.36}}, accept = {$\frac{36}{135} \approx 27\%$}, abstract = {<p>Software testing is the most widely used approach for assessing and improving software quality, but it is inherently incomplete and may not be representative of how the software is used in the field. This paper addresses the questions of to what extent tests represent how real users use software, and how to measure behavioral differences between test and field executions. We study four real-world systems, one used by end-users and three used by other (client) software, and compare test suites written by the systems' developers to field executions using four models of behavior: statement coverage, method coverage, mutation score, and a temporal-invariant-based model we developed. We find that developer-written test suites fail to accurately represent field executions: the tests, on average, miss 6.2% of the statements and 7.7% of the methods exercised in the field; the behavior exercised only in the field kills an extra 8.6% of the mutants; finally, the tests miss 52.6% of the behavioral invariants that occur in the field. In addition, augmenting the in-house test suites with automatically-generated tests only marginally improves the tests' behavioral representativeness. These differences between field and test executions, and in particular the finer-grained and more sophisticated ones that we measured using our invariant-based model, can provide insight for developers and suggest a better method for measuring test suite quality.</p>}, fundedBy = {NSF IIS-1239334, NSF CMMI-1234070, NSF CNS-1513055, NSF CCF-1453474.}, }