Diversity-Driven Automated Formal Verification
by Emily First, Yuriy Brun
Abstract:

Formally verified correctness is one of the most desirable properties of software systems. But despite great progress made via interactive theorem provers, such as Coq, writing proof scripts for verification remains one of the most effort-intensive (and often prohibitively difficult) software development activities. Recent work has created tools that automatically synthesize proofs or proof scripts. For example, CoqHammer can prove 26.6% of theorems completely automatically by reasoning using precomputed facts, while TacTok and ASTactic, which use machine learning to model proof scripts and then perform biased search through the proof-script space, can prove 12.9% and 12.3% of the theorems, respectively. Further, these three tools are highly complementary; together, they can prove 30.4% of the theorems fully automatically. Our key insight is that control over the learning process can produce a diverse set of models, and that, due to the unique nature of proof synthesis (the existence of the theorem prover, an oracle that infallibly judges a proof's correctness), this diversity can significantly improve these tools' proving power. Accordingly, we develop Diva, which uses a diverse set of models with TacTok's and ASTactic's search mechanism to prove 21.7% of the theorems. That is, Diva proves 68% more theorems than TacTok and 77% more than ASTactic. Complementary to CoqHammer, Diva proves 781 theorems (27% added value) that Coq-Hammer does not, and 364 theorems no existing tool has proved automatically. Together with CoqHammer, Diva proves 33.8% of the theorems, the largest fraction to date. We explore nine dimensions for learning diverse models, and identify which dimensions lead to the most useful diversity. Further, we develop an optimization to speed up Diva's execution by 40X. Our study introduces a completely new idea for using diversity in machine learning to improve the power of state-of-the-art proof-script synthesis techniques, and empirically demonstrates that the improvement is significant on a dataset of 68K theorems from 122 open-source software projects.

Citation:
Emily First and Yuriy Brun, Diversity-Driven Automated Formal Verification, in Proceedings of the 44th International Conference on Software Engineering (ICSE), 2022, pp. 749–761 (ACM SIGSOFT Distinguished Paper Award).
Bibtex:
@inproceedings{First22icse,
  author = {Emily First and Yuriy Brun},
  title =
  {\href{http://people.cs.umass.edu/brun/pubs/pubs/First22icse.pdf}{Diversity-Driven Automated Formal Verification}},
  booktitle = {Proceedings of the 44th International Conference on Software Engineering (ICSE)},
  venue = {ICSE},
  address = {Pittsburgh, PA, USA},
  month = {May},
  date = {22--27},
  year = {2022},
  pages = {749--761},
  note = {\raisebox{-.5ex}{\includegraphics[height=2.5ex]{trophy}}~ACM SIGSOFT Distinguished Paper Award. 
  ACM artifact badges granted: 
  \href{https://www.acm.org/publications/policies/artifact-review-and-badging-current}{\raisebox{-.75ex}{\includegraphics[height=2.5ex]{ACMArtifactAvailable}}~Artifact Available}. 
  \href{https://doi.org/10.1145/3510003.3510138}{DOI: 10.1145/3510003.3510138}}, 
  doi = {10.1145/3510003.3510138},
  comment = {<span class="emphasis">ACM SIGSOFT Distinguished Paper Award</span>},

  accept = {$\frac{197}{751} \approx 26\%$},

  abstract = {<p>Formally verified correctness is one of the most desirable properties of
  software systems. But despite great progress made via interactive theorem
  provers, such as Coq, writing proof scripts for verification remains one of
  the most effort-intensive (and often prohibitively difficult) software
  development activities. Recent work has created tools that automatically
  synthesize proofs or proof scripts. For example, CoqHammer can prove 26.6% of
  theorems completely automatically by reasoning using precomputed facts, while
  TacTok and ASTactic, which use machine learning to model proof scripts and
  then perform biased search through the proof-script space, can prove 12.9%
  and 12.3% of the theorems, respectively. Further, these three tools are
  highly complementary; together, they can prove 30.4% of the theorems fully
  automatically. Our key insight is that control over the learning process can
  produce a diverse set of models, and that, due to the unique nature of proof
  synthesis (the existence of the theorem prover, an oracle that infallibly
  judges a proof's correctness), this diversity can significantly improve these
  tools' proving power. Accordingly, we develop Diva, which uses a diverse set
  of models with TacTok's and ASTactic's search mechanism to prove 21.7% of the
  theorems. That is, Diva proves 68% more theorems than TacTok and 77% more
  than ASTactic. Complementary to CoqHammer, Diva proves 781 theorems (27%
  added value) that Coq-Hammer does not, and 364 theorems no existing tool has
  proved automatically. Together with CoqHammer, Diva proves 33.8% of the
  theorems, the largest fraction to date. We explore nine dimensions for
  learning diverse models, and identify which dimensions lead to the most
  useful diversity. Further, we develop an optimization to speed up Diva's
  execution by 40X. Our study introduces a completely new idea for using
  diversity in machine learning to improve the power of state-of-the-art
  proof-script synthesis techniques, and empirically demonstrates that the
  improvement is significant on a dataset of 68K theorems from 122 open-source
  software projects.</p>},

  fundedBy = {NSF CCF-1763423, Amazon},
}