ProofCoop: Collaborative Automated Formal Verification (bibtex)

by Zhanna Kaufman, Emily First, Alex Sanchez-Stern, Kyle Thompson, Sorin Lerner and Yuriy Brun

Abstract:

Formal verification using proof assistants, such as Coq or Lean, is an effective way of ensuring software correctness. Recent research has shown that machine learning can automate proof synthesis, but such approaches are only successful a fraction of the time. Methods for improving such proof synthesis include training more precise models, improving the model-driven synthesis search mechanisms, and effectively combining multiple models into synthesis search. This paper focuses on the latter and develops ProofCoop, for the first time demonstrating collaborative cooperation between models. Diva, the state-of-the-art method for combining models, uses a disparate search for each model. On the CoqGym benchmark of 68.5K theorems from 124 open-source Coq projects a Diva combination of five models automatically proves 3,287 (27.5%) of CoqGym's 11.9K test set theorems, while the best individual model proves only 2,604 (21.8%). By contrast, our method, ProofCoop can use the same five models to fully automatically prove 3,932 (33.0%) theorems, meaning that ProofCoop is 19.6% more likely to prove a theorem, on average, than the prior state of the art combination method. ProofCoop allows its models to build on each other's contributions and demonstrates that such collaboration significantly increases synthesis success. ProofCoop enables six different types of collaboration: joint model next-step prediction at each search step; preferential next-step prediction via voting, bidding, and stacking; sharing lemmas proven across models mid-search, and models completing each other's partial proofs. Together with CoqHammer, ProofCoop synthesizes proofs for 36.0% of the theorems. Our research demonstrates that creative uses of learned models can lead to collaborative synthesis that is more effective than prior approaches, suggesting a powerful new research direction in automated formal verification.

View PDF

Citation:

Zhanna Kaufman, Emily First, Alex Sanchez-Stern, Kyle Thompson, Sorin Lerner, and Yuriy Brun, ProofCoop: Collaborative Automated Formal Verification, in Proceedings of the 48th International Conference on Software Engineering (ICSE), 2026, pp. 1339–1352.

Bibtex Entry:

@inproceedings{Kaufman26icse,
  author = {Zhanna Kaufman and Emily First and Alex Sanchez-Stern and 
            Kyle Thompson and Sorin Lerner and Yuriy Brun},
  title =
  {ProofCoop: Collaborative Automated Formal Verification},
  booktitle = {Proceedings of the 48th International Conference on Software Engineering (ICSE)},
  venue = {ICSE},
  address = {Rio de Janeiro, Brazil},
  month = {April},
  date = {15--17},
  year = {2026},
  pages = {1339--1352},
  
  doi = {10.1145/3744916.3787783},
  note = {ACM artifact badges granted: 
  Artifact Available, 
  Artifact Reusable.
  DOI:
  10.1145/3744916.3787783},


  accept = {$\frac{160}{809} \approx 20\%$ (2nd cycle)},

  abstract = {Formal verification using proof assistants, such as Coq or
  Lean, is an effective way of ensuring software correctness. Recent research
  has shown that machine learning can automate proof synthesis, but such
  approaches are only successful a fraction of the time. Methods for
  improving such proof synthesis include training more precise models,
  improving the model-driven synthesis search mechanisms, and effectively
  combining multiple models into synthesis search. This paper focuses on the
  latter and develops ProofCoop, for the first time demonstrating
  collaborative cooperation between models. Diva, the state-of-the-art method
  for combining models, uses a disparate search for each model. On the CoqGym
  benchmark of 68.5K theorems from 124 open-source Coq projects a Diva
  combination of five models automatically proves 3,287 (27.5%) of CoqGym's
  11.9K test set theorems, while the best individual model proves only 2,604
  (21.8%). By contrast, our method, ProofCoop can use the same five models to
  fully automatically prove 3,932 (33.0%) theorems, meaning that ProofCoop is
  19.6% more likely to prove a theorem, on average, than the prior state of
  the art combination method. ProofCoop allows its models to build on each
  other's contributions and demonstrates that such collaboration
  significantly increases synthesis success. ProofCoop enables six different
  types of collaboration: joint model next-step prediction at each search
  step; preferential next-step prediction via voting, bidding, and stacking;
  sharing lemmas proven across models mid-search, and models completing each
  other's partial proofs. Together with CoqHammer, ProofCoop synthesizes
  proofs for 36.0% of the theorems. Our research demonstrates that creative
  uses of learned models can lead to collaborative synthesis that is more
  effective than prior approaches, suggesting a powerful new research
  direction in automated formal verification.},

  fundedBy = {NSF CCF-2210243,
  Defense Advanced Research Projects Agencies (DARPA) under Contract No. HR0011-24-2-0307},
}