TestExplora: Benchmarking LLMs for Proactive Bug Discovery via Repository-Level Test Generation
Steven Liu ⋅ Jane Luo ⋅ Xin Zhang ⋅ Aofan Liu ⋅ Hao Liu ⋅ Jie Wu ⋅ Ziyang Huang ⋅ Yangyu Huang ⋅ Yu Kang ⋅ Scarlett Li
Abstract
Given that Large Language Models (LLMs) are increasingly applied to automate software development, comprehensive software assurance spans three distinct goals: regression prevention, reactive reproduction, and proactive discovery. Current evaluations systematically overlook the third goal. Specifically, they either constrain models to a compliance trap by treating existing code as the ground truth for regression prevention, or rely on post-failure artifacts (e.g., issue reports) for reactive bug reproduction, failing to expose defects before they manifest as failures. To bridge this gap, we present TestExplora, a benchmark designed to evaluate LLMs as proactive testers within full-scale, realistic repository environments. Comprising 2,389 tasks across 482 repositories, TestExplora conceals all defect-related information, forcing models to uncover bugs by identifying discrepancies between implementation and documentation-derived intent—utilizing documentation as the reference oracle. Furthermore, to ensure sustainable evaluation and mitigate risks of data leakage in static datasets, we propose a continuous, time-aware data collection framework. Our evaluation reveals a significant capability gap: state-of-the-art models achieve a maximum Fail-to-Pass ($F2P$) rate of only 16.06%. Further analysis indicates that navigating complex cross-module interactions and leveraging agentic exploration are critical to advancing LLMs toward autonomous software quality assurance. Consistent with this, SWEAgent instantiated with GPT-5-mini achieves an $F2P$ of 17.27% and an $F2P@5$ of 29.7%, highlighting the effectiveness and promise of agentic exploration in proactive bug discovery tasks.
Successful Page Load