QEDCartographer: Automating Formal Verification Using Reward-Free Reinforcement Learning
by Alex Sanchez-Stern, Abhishek Varghese, Zhanna Kaufman, Dylan Zhang, Talia Ringer, Yuriy Brun
Abstract:

Formal verification is a promising method for producing reliable software, but the difficulty of manually writing verification proofs severely limits its utility in practice. Recent methods have automated some proof synthesis by guiding a search through the proof space using a theorem prover. Unfortunately, the theorem prover provides only the crudest estimate of progress, resulting in effectively undirected search. To address this problem, we create QEDCartographer, an automated proof-synthesis tool that combines supervised and reinforcement learning to more effectively explore the proof space. QEDCartographer incorporates the proofs' branching structure, enabling reward-free search and overcoming the sparse reward problem inherent to formal verification. We evaluate QEDCartographer using the CoqGym benchmark of 68.5K theorems from 124 open-source Coq projects. QEDCartographer fully automatically proves 21.4% of the test-set theorems. Previous search-based proof-synthesis tools Tok, Tac, ASTactic, Passport, and Proverbot9001, which rely only on supervised learning, prove 9.6%, 9.8%, 10.9%, 12.5%, and 19.8%, respectively. Diva, which combines 62 tools, proves 19.2%. Comparing to the most effective prior tool, Proverbot9001, QEDCartographer produces 26% shorter proofs 27% faster, on average over the theorems both tools prove. Together, QEDCartographer and non-learning-based CoqHammer prove 31.8% of the theorems, while CoqHammer alone proves 26.6%. Our work demonstrates that reinforcement learning is a fruitful research direction for improving proof-synthesis tools' search mechanisms.

Citation:
Alex Sanchez-Stern, Abhishek Varghese, Zhanna Kaufman, Dylan Zhang, Talia Ringer, and Yuriy Brun, QEDCartographer: Automating Formal Verification Using Reward-Free Reinforcement Learning, in Proceedings of the 47th International Conference on Software Engineering (ICSE), 2025.
Bibtex:
@inproceedings{Sanchez-Stern25icse,
  author = {Alex Sanchez-Stern and Abhishek Varghese and Zhanna Kaufman and 
            Dylan Zhang and Talia Ringer and Yuriy Brun},
  title =
  {\href{http://people.cs.umass.edu/brun/pubs/pubs/Sanchez-Stern25icse.pdf}{QEDCartographer: 
         Automating Formal Verification Using Reward-Free Reinforcement Learning}},
  booktitle = {Proceedings of the 47th International Conference on Software Engineering (ICSE)},
  venue = {ICSE},
  address = {Ottowa, ON, Canada},
  month = {April},
  date = {28--30},
  year = {2025},
  accept = {$\frac{46}{523} \approx 8.8\%$ (1st cycle direct accept, without revision)},
  note= {arXiv: \href{https://arxiv.org/abs/2408.09237}{abs/2408.09237}},

  abstract = {<p>Formal verification is a promising method for producing
  reliable software, but the difficulty of manually writing verification
  proofs severely limits its utility in practice. Recent methods have
  automated some proof synthesis by guiding a search through the proof space
  using a theorem prover. Unfortunately, the theorem prover provides only the
  crudest estimate of progress, resulting in effectively undirected search.
  To address this problem, we create QEDCartographer, an automated
  proof-synthesis tool that combines supervised and reinforcement learning to
  more effectively explore the proof space. QEDCartographer incorporates the
  proofs' branching structure, enabling reward-free search and overcoming the
  sparse reward problem inherent to formal verification. We evaluate
  QEDCartographer using the CoqGym benchmark of 68.5K theorems from 124
  open-source Coq projects. QEDCartographer fully automatically proves 21.4%
  of the test-set theorems. Previous search-based proof-synthesis tools Tok,
  Tac, ASTactic, Passport, and Proverbot9001, which rely only on supervised
  learning, prove 9.6%, 9.8%, 10.9%, 12.5%, and 19.8%, respectively. Diva,
  which combines 62 tools, proves 19.2%. Comparing to the most effective
  prior tool, Proverbot9001, QEDCartographer produces 26% shorter proofs 27%
  faster, on average over the theorems both tools prove. Together,
  QEDCartographer and non-learning-based CoqHammer prove 31.8% of the
  theorems, while CoqHammer alone proves 26.6%. Our work demonstrates that
  reinforcement learning is a fruitful research direction for improving
  proof-synthesis tools' search mechanisms.</p>},
}