DevEvol: Benchmarking LLM Agents on Continuous Software Evolution
Gangda Deng ⋅ Zhaoling Chen ⋅ Zhongming Yu ⋅ Haoyang Fan ⋅ Yuhong Liu ⋅ Yuxin Yang ⋅ Dhruv Parikh ⋅ Rajgopal Kannan ⋅ Le Cong ⋅ Mengdi Wang ⋅ Qian Zhang ⋅ Viktor Prasanna ⋅ Robert Tang ⋅ Xingyao Wang
Abstract
Large Language Model (LLM) agents have demonstrated remarkable proficiency in solving isolated software engineering tasks. However, existing benchmarks predominantly evaluate static, independent issues, failing to reflect the continuous and sequentially dependent nature of real-world software evolution. We introduce DeepCommit, an automated pipeline that reconstructs verifiable software evolution trajectories from git histories as Milestone DAGs, and DevEvol, a benchmark for streaming evaluation over evolving codebases. This setting requires agents to manage long-term context, architectural consistency, and technical debt. Our evaluation reveals a fundamental performance gap: even frontier models achieve only $\sim$35\% Score and $\sim$10\% Resolve Rate in continuous environments, driven by a ``snowball effect'' where early errors accumulate and block downstream development. These results demonstrate that strong snapshot performance substantially overestimates real-world agent capability, establishing long-horizon software evolution as a critical unsolved challenge. Our code and dataset are available at https://anonymous.4open.science/r/DevEvol-48A8.
Successful Page Load