Poster Wed, Jul 8, 2026 • 5:00 PM – 6:45 PM KST Coex: HALL A

QEDBench: Quantifying the Alignment Gap in Automated Evaluation of University-Level Mathematical Proofs

Santiago Gonzalez ⋅ Alireza Amiribavandpour ⋅ Peter Ye ⋅ Edward Zhang ⋅ Ruslans Aleksejevs ⋅ Todor Antić ⋅ Polina Baron ⋅ Sujeet Bhalerao ⋅ Shubhrajit Bhattacharya ⋅ Zachary Burton ⋅ John Byrne ⋅ Hyungjun Choi ⋅ Nujhat Ahmed Disha ⋅ Koppány I Encz ⋅ Yuchen Fang ⋅ Robert Joseph George ⋅ Ebrahim Ghorbani ⋅ Alan Goldfarb ⋅ Jing Guo ⋅ Meghal Gupta ⋅ Stefano Huber ⋅ Annika Kanckos ⋅ Minjung Kang ⋅ Hyun Jong Kim ⋅ Dino Lorenzini ⋅ Levi Lorenzo ⋅ Tianyi Mao ⋅ Giovanni Marzenta ⋅ Ariane Masuda ⋅ Lukas Mauth ⋅ Ana Mickovic ⋅ Andrés Miniguano-Trujillo ⋅ Antoine Moulin ⋅ Wenqi Ni ⋅ Tomos Parry ⋅ Kevin Ren ⋅ Hossein Roodbarani ⋅ Mathieu Rundström ⋅ Manjil Saikia ⋅ Detchat Samart ⋅ Rebecca Steiner ⋅ Connor Stewart ⋅ Dhara Thakkar ⋅ Jeffrey Tse ⋅ Vasiliki Velona ⋅ Yunhai Xiang ⋅ Sibel Yalçın ⋅ Jun Yan ⋅ Ji Zeng ⋅ Arman Cohan ⋅ Quanquan Liu

Abstract

As Large Language Models (LLMs) saturate elementary benchmarks, the research frontier has shifted from generation to the reliability of automated evaluation. We demonstrate that standard "LLM-as-a-Judge" protocols suffer from a systematic evaluation Alignment Gap when applied to upper-undergraduate to early graduate level mathematics. To quantify this, we introduce QEDBench, the first benchmark to systematically measure alignment with human experts on undergraduate-level math proofs by contrasting course-specific rubrics against expert common knowledge criteria. By deploying a dual-evaluation matrix ($7$ judges $\times$ $5$ solvers) against 1,000+ hours of human evaluation, we reveal that certain frontier evaluators like Claude 4.5 Opus exhibit significant positive bias (up to $+0.28$ mean score inflation), effectively "hallucinating rigor" in flawed proofs. Furthermore, we uncover a critical reasoning disparity: while Gemini 3.0 Pro achieves state-of-the-art performance (0.91 raw score), specialized reasoning models like o3-deep-research collapse in discrete domains, dropping to 42.1\% accuracy in Graph Theory. We release QEDBench as a public benchmark for evaluating and improving AI judges.