Is the Cure Worse than the Disease? Overfitting in Automated Program Repair
by Edward K. Smith, Earl Barr, Claire Le Goues, Yuriy Brun
Abstract:

Automated program repair has shown promise for reducing the significant manual effort debugging requires. This paper addresses a deficit of earlier evaluations of automated repair techniques caused by repairing programs and evaluating generated patches' correctness using the same set of tests. Since tests are an imperfect metric of program correctness, evaluations of this type do not discriminate between correct patches and patches that overfit the available tests and break untested but desired functionality. This paper evaluates two well-studied repair tools, GenProg and TrpAutoRepair, on a publicly available benchmark of 998 bugs, each with a human-written patch. By evaluating patches using tests independent from those used during repair, we find that the tools are unlikely to improve the proportion of independent tests passed, and that the quality of the patches is proportional to the coverage of the test suite used during repair. For programs that pass most tests, the tools are as likely to break tests as to fix them. However, novice developers also overfit, and automated repair performs no worse than these developers. In addition to overfitting, we measure the effects of test suite coverage, test suite provenance, and starting program quality, as well as the difference in quality between novice-developer-written and tool-generated patches when quality is assessed with a test suite independent from the one used for patch generation.

Citation:
Edward K. Smith, Earl Barr, Claire Le Goues, and Yuriy Brun, Is the Cure Worse than the Disease? Overfitting in Automated Program Repair, in Proceedings of the 10th Joint Meeting of the European Software Engineering Conference and ACM SIGSOFT Symposium on the Foundations of Software Engineering (ESEC/FSE), 2015, pp. 532–543.
Related:
Previous versions appeared as University of Massachusetts Computer Science technical report UM-CS-2015-007 and as UC Davis College of Engineering technical report \url{https://escholarship.org/uc/item/3z8926ks}.
Bibtex:
@inproceedings{Smith15fse,
  author = {Edward K. Smith and Earl Barr and Claire {Le Goues} and Yuriy Brun},
  title = {\href{http://people.cs.umass.edu/brun/pubs/pubs/Smith15fse.pdf}{Is
  the Cure Worse than the Disease? Overfitting in Automated Program Repair}},
  booktitle = {Proceedings of the 10th Joint Meeting of the European
  Software Engineering Conference and ACM SIGSOFT Symposium on the
  Foundations of Software Engineering (ESEC/FSE)},
  venue = {ESEC/FSE},
  month = {September},
  year = {2015},
  date = {2--4},
  address = {Bergamo, Italy},
  pages = {532--543},
  accept = {$\frac{74}{291} \approx 25\%$},
  doi = {10.1145/2786805.2786825},
	
  note = {Previous versions appeared as University of Massachusetts Computer
  Science technical report UM-CS-2015-007 and as UC Davis College of Engineering 
	technical report \url{https://escholarship.org/uc/item/3z8926ks}.
  \href{https://doi.org/10.1145/2786805.2786825}{DOI: 10.1145/2786805.2786825}},
  previous = {Previous versions appeared as University of Massachusetts Computer
  Science technical report UM-CS-2015-007 and as UC Davis College of Engineering 
	technical report \url{https://escholarship.org/uc/item/3z8926ks}.},	

  abstract = {<p>Automated program repair has shown promise for reducing the
  significant manual effort debugging requires. This paper addresses a
  deficit of earlier evaluations of automated repair techniques caused by
  repairing programs and evaluating generated patches' correctness using the
  same set of tests. Since tests are an imperfect metric of program
  correctness, evaluations of this type do not discriminate between correct
  patches and patches that overfit the available tests and break untested but
  desired functionality. This paper evaluates two well-studied repair tools,
  GenProg and TrpAutoRepair, on a publicly available benchmark of 998 bugs,
  each with a human-written patch. By evaluating patches using tests
  independent from those used during repair, we find that the tools are
  unlikely to improve the proportion of independent tests passed, and that
  the quality of the patches is proportional to the coverage of the test
  suite used during repair. For programs that pass most tests, the tools are
  as likely to break tests as to fix them. However, novice developers also
  overfit, and automated repair performs no worse than these developers. In
  addition to overfitting, we measure the effects of test suite coverage,
  test suite provenance, and starting program quality, as well as the
  difference in quality between novice-developer-written and tool-generated
  patches when quality is assessed with a test suite independent from the one
  used for patch generation. </p>},

  fundedBy = {NSF CCF-1446683, NSF CCF-1446966, NSF CCF-1453474, 
  Microsoft Research via a SEIF award},
}