Beyond Theorem Proving: Formulation, Framework and Benchmark for Formal Problem-Solving
Abstract
Large language models (LLMs) have achieved remarkable progress in mathematical reasoning, yet persistently suffer from hallucinations and erroneous logic. While formal theorem proving (FTP) shows promise in process-level reliability, it is limited to verification (checking known propositions). This leaves constructive problem-solving (finding unknown terms that satisfy specific conditions) underexplored and disconnected from process-level verifiability. To bridge this gap, we introduce FPS (Formal Problem-Solving), a principled framework to encompass the end-to-end problem-solving process in Lean 4. In FPS, the answer is an unknown metavariable coupled with a proof obligation, forcing it to be mathematically derived and verified. We further present D-FPS (Deductive FPS), which enforces a rigorous chain-of-thought structure, aligning formal derivation with human reasoning steps. To support this direction, we construct three benchmarks via the manual refactoring of over 1,000 problems: FormalMath500, MiniF2F-Solving, and PutnamBench-Solving. We further propose RPE (Restricted Propositional Equivalence), a symbolic metric that evaluates semantic correctness beyond brittle string matching. Extensive experiments with state-of-the-art provers reveal that solving is significantly harder than proving, highlighting the ``alignment tax'' required to transition from loose validity checking to constructive, human-aligned reasoning.