Poster Tue, Jul 7, 2026 • 10:30 AM – 12:15 PM KST Coex: HALL A

Beyond Theorem Proving: Formulation, Framework and Benchmark for Formal Problem-Solving

Qi Liu ⋅ Xinhao Zheng ⋅ Renqiu Xia ⋅ Xingzhi Qi ⋅ Qinxiang Cao ⋅ Junchi Yan

Abstract

Large language models (LLMs) have achieved remarkable progress in mathematical reasoning, yet persistently suffer from hallucinations and erroneous logic. While formal theorem proving (FTP) shows promise in process-level reliability, it is limited to verification (checking known propositions). This leaves constructive problem-solving (finding unknown terms that satisfy specific conditions) underexplored and disconnected from process-level verifiability. To bridge this gap, we introduce FPS (Formal Problem-Solving), a principled framework to encompass the end-to-end problem-solving process in Lean 4. In FPS, the answer is an unknown metavariable coupled with a proof obligation, forcing it to be mathematically derived and verified. We further present D-FPS (Deductive FPS), which enforces a rigorous chain-of-thought structure, aligning formal derivation with human reasoning steps. To support this direction, we construct three benchmarks via the manual refactoring of over 1,000 problems: FormalMath500, MiniF2F-Solving, and PutnamBench-Solving. We further propose RPE (Restricted Propositional Equivalence), a symbolic metric that evaluates semantic correctness beyond brittle string matching. Extensive experiments with state-of-the-art provers reveal that solving is significantly harder than proving, highlighting the ``alignment tax'' required to transition from loose validity checking to constructive, human-aligned reasoning.