Poster Wed, Jul 8, 2026 • 10:30 AM – 12:15 PM KST Coex: HALL A

FormalRx: Rectify and eXamine Semantic Failures in Autoformalization

Haocheng Wang ⋅ Baiyu Huang ⋅ Yingjia Wan ⋅ Xiao Zhu ⋅ Xiaoyang Liu ⋅ Yinya Huang ⋅ Zhijiang Guo

Abstract

Autoformalization—translating mathematical problems from natural language into formal proof assistant code—is essential for rigorous machine reasoning. However, existing evaluation frameworks provide only opaque binary verdicts or scalar scores, offering no interpretable insight into where or why translations fail. This opacity severely limits both human understanding and automated system improvement. To bridge this gap, we introduce \textbf{FormalRX}, a comprehensive diagnostic evaluation framework that transforms autoformalization assessment from black-box judgments into actionable feedback. At its core is \textbf{\textsc{Sci} Error Taxonomy}, a hierarchical classification scheme decomposing autoformalization errors into 28 distinct categories with strict priority ordering. Building on this taxonomy, FormalRX provides four critical diagnostic capabilities: (1) alignment verdicts, (2) error categorization, (3) error localization, and (4) correction. Our diagnostic model % specialized diagnostic models and evaluate them across all tasks. Our model FormalRX-8B, achieves F1-scores of 0.88 (verdict) and 0.71 (categorization), along with accuracies of 0.75 (localization) and 0.73 (correction), substantially outperforming both general-purpose LLMs and specialized baselines. By connecting evaluation with actionable insights, FormalRXenables systematic diagnosis and improvement of autoformalization systems.