Toward More Reliable Agent Evaluation: A Component-Based Benchmark Auditing Pipeline
Hyewon Suh ⋅ Seojune Lee ⋅ Binfei Ji ⋅ Rishi Khare ⋅ Basit Khan ⋅ Hyunjun Kim ⋅ Tianyi Zhang ⋅ Venkat Krishna Srinivasan ⋅ Peter Belcak ⋅ Shizhe Diao ⋅ Pavlo Molchanov ⋅ Yingyan (Celine) Lin ⋅ Zhen Dong
Abstract
Reliable evaluation of large language model (LLM) agents depends critically on benchmark validity. However, agent benchmarks are increasingly complex and often contain hidden flaws arising from interactions among user instructions, environments, tools, ground-truth trajectories, and evaluation protocols. These issues confound model errors with benchmark artifacts, undermining leaderboard-based comparisons. Manual auditing does not scale to this setting, while existing automated methods are not designed to systematically capture semantic and contextual issues across interacting benchmark components. We propose the **COBA** (**CO**mponent-based **B**enchmark **A**uditing) pipeline, an automated pipeline for diagnosing and filtering validity issues in agent benchmarks. Our pipeline decomposes agent tasks into four standardized components—User, Environment, Ground Truth, and Evaluation—and operationalizes a component-level issue taxonomy using hybrid rule-based detectors and taxonomy-guided LLM evaluation, augmented with an adversarial rebuttal stage to reduce false positives. Across six widely used agent benchmarks, COBA achieves strong alignment with expert judgments, with F1 scores between 0.791 and 0.874. The pipeline complements manual verification of $\tau^2$-bench by identifying issues missed due to benchmark complexity and generalizes effectively to previously unseen benchmarks with minimal adaptation. Our analysis shows that benchmark flaws are widespread and materially affect agent evaluation outcomes, demonstrating that component-based automated auditing provides a scalable foundation for more reliable and interpretable agent evaluation.
Successful Page Load