FormalRx: Rectify and eXamine Semantic Failures in Autoformalization
Abstract
Autoformalization—translating mathematical problems from natural language into formal proof assistant code—is essential for rigorous machine reasoning. However, existing evaluation frameworks provide only opaque binary verdicts or scalar scores, offering no interpretable insight into where or why translations fail. This opacity severely limits both human understanding and automated system improvement. To bridge this gap, we introduce \textbf{FormalRX}, a comprehensive diagnostic evaluation framework that transforms autoformalization assessment from black-box judgments into actionable feedback. At its core is \textbf{\textsc{Sci} Error Taxonomy}, a hierarchical classification scheme decomposing autoformalization errors into 28 distinct categories with strict priority ordering. Building on this taxonomy, FormalRX provides four critical diagnostic capabilities: (1) alignment verdicts, (2) error categorization, (3) error localization, and (4) correction. Our diagnostic model % specialized diagnostic models and evaluate them across all tasks. Our model FormalRX-8B, achieves F1-scores of 0.88 (verdict) and 0.71 (categorization), along with accuracies of 0.75 (localization) and 0.73 (correction), substantially outperforming both general-purpose LLMs and specialized baselines. By connecting evaluation with actionable insights, FormalRXenables systematic diagnosis and improvement of autoformalization systems.