Goedel-Prover-V2: Scaling Formal Theorem Proving with Scaffolded Data Synthesis and Self-Correction
Yong Lin · Shange Tang · Bohan Lyu · Ziran Yang · Jui-Hui Chung · Haoyu Zhao · Lai Jiang · Yihan Geng · Jiawei Ge · Jingruo Sun · Jiayun Wu · Jiri Gesi · Ximing Lu · David Acuna · Kaiyu Yang · Hongzhou Lin · Yejin Choi · Danqi Chen · Sanjeev Arora · Chi Jin
Abstract
Automated theorem proving (ATP) ---the task of generating a proof that passes automated proof verification given a math question in formal language--- is a critical challenge at the intersection of mathematics and Artificial Intelligence (AI). We introduce Goedel-Prover-V2, a family of two language models that establish a new state-of-the-art (SOTA) in open-source ATP, using the Lean proof assistant. In addition to standard expert iteration and reinforcement learning, our approach incorporates three key innovations: (1) During training when improvement plateaus on human questions, the prover does {\em scaffolded data synthesis} to generate synthetic questions of increasing difficulty for its own training; (2) The prover is trained to self-correct using Lean compiler feedback; (3) Improved test-time exploration through checkpoint averaging to balance accuracy and diversity. Our small model, Goedel-Prover-V2-8B, reaches 84.6\% pass@32 on MiniF2F and outperforms DeepSeek-Prover-V2-671B despite being $80\times$ smaller. Our flagship model, Goedel-Prover-V2-32B, achieves 88.1\% on MiniF2F at pass@32 in standard mode and 90.4\% in self-correction mode, outperforming prior SOTA by a large margin. Additionally, our flagship model solves 86 problems on PutnamBench at pass@184, securing first place among open-source models and surpassing DeepSeek-Prover-V2-671B's record of 47 problems by pass@1024 with about $20\times$ smaller model size and significantly lower compute budget. To support community research, we have open-sourced the prover models and a SOTA statement formalizer, with all training datasets to be released in the near future.
Successful Page Load