Poster Thu, Jul 9, 2026 • 2:30 PM – 4:15 PM KST Coex: HALL A

Position: Code Benchmarks Should Prioritize Rigor, Reliability, and Reproducibility

Jialun Cao ⋅ Yuk-Kit Chan ⋅ Zixuan Ling ⋅ Wenxuan Wang ⋅ Shuqing Li ⋅ Mingwei Liu ⋅ Ruixi Qiao ⋅ Yuting Han ⋅ Chaozheng Wang ⋅ Boxi Yu ⋅ Pinjia He ⋅ Shuai Wang ⋅ Zibin Zheng ⋅ Michael Lyu ⋅ Shing-Chi Cheung

Abstract

Code-related benchmarks play a critical role in evaluating large language models (LLMs), yet their quality fundamentally shapes how the com- munity interprets model capabilities. In the past few years, awareness of benchmark quality has grown. Yet, after a decade-scale (2014 - 2025) survey over 572 code benchmarks, we observed a lag between growing awareness and actual prac- tice. For example, in 2025 alone, the number of benchmarks that ignore code coverage when pro- viding test cases nearly matches the total count accumulated across the previous ten years. In response, we take a clear position: Code bench- marks must prioritize rigor in benchmark con- struction, reliability in evaluation, and repro- ducibility in release. To operationalize this po- sition, we introduce a code benchmark guideline HOW2BENCH with 55 checklists. Finally, our further human study also exposed that the current issues not only stem from the significant effort required, but also from a lack of awareness re- garding their importance.