Faults in Our Formal Benchmarking: Dataset Defects and Evaluation Failures in Lean Theorem Proving
Abstract
Benchmarks for LLM-assisted theorem proving in Lean are often treated as intrinsically reliable because every solved instance comes with a machine-checked proof. However, the kernel only checks that a proof establishes a \emph{formal} statement; it does not verify that the statement faithfully encodes the intended informal problem, nor that evaluation harnesses are robust to trivial or adversarial solutions. We audit widely used Lean theorem-proving benchmarks and find recurring defects in every dataset we examined, including missing hypotheses, problem simplification, incomplete or incorrect translations, and Lean-specific specification hazards. Beyond dataset construction, we survey and identify evaluation-time failure modes that can inflate reported success without demonstrating meaningful proof capability. We propose a fault taxonomy, a suite of automated checkers and prompts, and release standards to guide the creation of formal math datasets and make evaluation more reproducible and trustworthy.