Position: Code Benchmarks Should Prioritize Rigor, Reliability, and Reproducibility
Abstract
Code-related benchmarks play a critical role in evaluating large language models (LLMs), yet their quality fundamentally shapes how the com- munity interprets model capabilities. In the past few years, awareness of benchmark quality has grown. Yet, after a decade-scale (2014 - 2025) survey over 572 code benchmarks, we observed a lag between growing awareness and actual prac- tice. For example, in 2025 alone, the number of benchmarks that ignore code coverage when pro- viding test cases nearly matches the total count accumulated across the previous ten years. In response, we take a clear position: Code bench- marks must prioritize rigor in benchmark con- struction, reliability in evaluation, and repro- ducibility in release. To operationalize this po- sition, we introduce a code benchmark guideline HOW2BENCH with 55 checklists. Finally, our further human study also exposed that the current issues not only stem from the significant effort required, but also from a lack of awareness re- garding their importance.