Divide and Abstract: Autoformalization via Decomposition and Abstraction Learning
Marcus Min · Yeqi Gao · Wilson Sy · Zhaoyu Li · Xujie Si · Osbert Bastani
Abstract
Autoformalization, the task of translating informal mathematics into formal machine-verifiable languages, has long been challenging, even for individual statements. Beyond the statement level, mathematicians invest substantial effort in formalizing reusable abstractions such as common definitions and relations, based on which a large corpus of statements can be easily formalized. While previous work focuses on fine-tuning models for single statement autoformalization, we introduce $\textit{Divide and Abstract (DNA)}$, an end-to-end framework that not only improves the models' test-time performance at formalizing a corpus of statements, but also learns a library of reusable formal abstractions, which scales to statements outside of the corpus. First, $\textit{DNA}$ extracts common mathematical concepts from the entire informal corpus and formalizes them as reusable abstractions. Conditioned on these learned abstractions, $\textit{DNA}$ decomposes each informal statement in the corpus into a structured collection of informal clauses, translates each clause into its formal correspondents, composes the formal clauses back together, and refines the final formalization given feedback from a symbolic validator. The entire framework requires zero training and thus scales to any formal language, particularly low-resource Domain-Specific Languages (DSL). $\textit{DNA}$ significantly improves performance by up to $\textbf{8.6}\times$, and advances the SOTA by $\textbf{57.8}$\% from $\textbf{40.8}$ to $\textbf{64.4}$.
Successful Page Load