NL2Repo-Bench: Towards Long-Horizon Repository Generation Evaluation of Coding Agents
Abstract
Recent advances in coding agents suggest rapid progress toward autonomous software development, yet existing benchmarks primarily evaluate short-horizon behaviors such as localized code generation, scaffolded completion, or repository repair, leaving it unclear whether agents can sustain coherent reasoning, planning, and execution over the extended horizons demanded by real-world repository construction. To address this gap, we introduce NL2Repo-Bench, a benchmark explicitly designed to evaluate the long-horizon repository generation from scratch: given only a single natural-language requirements document and an empty workspace, agents must autonomously design the architecture, manage dependencies, and produce a fully installable Python library. Experiments across state-of-the-art open- and closed-source models reveal that long-horizon repository generation remains largely unsolved, with even the strongest agents achieving merely 40\% average test pass rates and rarely completing an entire repository correctly. Further analysis identifies systematic long-horizon failure modes, including premature termination, loss of global coherence, fragile cross-file dependencies, and inadequate planning over hundreds of interaction steps. These results position NL2Repo-Bench as a rigorous, execution-based testbed for evaluating sustained agentic competence and highlight long-horizon reasoning as a key bottleneck for autonomous coding agents. Our data and code are available at https://anonymous.4open.science/r/nl2repobench-foricml-F4ED/.