Poster Tue, Jul 7, 2026 • 2:00 PM – 3:45 PM KST Coex: HALL A

NL2Repo-Bench: Towards Long-Horizon Repository Generation Evaluation of Coding Agents

Jingzhe Ding ⋅ Shengda Long ⋅ Changxin Pu ⋅ Ge Zhang ⋅ zhou huan ⋅ Hongwan Gao ⋅ Xiang Gao ⋅ Chao He ⋅ Yue Hou ⋅ FEI HU ⋅ Zhaojian Li ⋅ Weiran Shi ⋅ Zaiyuan Wang ⋅ Daoguang Zan ⋅ Chenchen Zhang ⋅ Xiaoxu Zhang ⋅ Chen Qizhi ⋅ cheng ⋅ Bo Deng ⋅ Qingshui Gu ⋅ Kai Hua ⋅ Juntao Lin ⋅ Pai Liu ⋅ Mingchen Li ⋅ Minghao Li ⋅ Xuanguang Pan ⋅ Zifan Peng ⋅ Yujia Qin ⋅ Yong Shan ⋅ Zhewen Tan ⋅ Haoran Wang ⋅ Zihan Wang ⋅ Weihao Xie ⋅ Yishuo Yuan ⋅ Jiayu Zhang ⋅ Yunfei Zhao ⋅ He Zhu ⋅ LIYA ZHU ⋅ chenyangzou ⋅ Ming Ding ⋅ Jiaheng Liu ⋅ Jianpeng Jiao ⋅ Liam Liu ⋅ Qian Liu ⋅ Chongyang Tao ⋅ Jian Yang ⋅ Tong Yang ⋅ Zhaoxiang Zhang ⋅ Xinjie Chen ⋅ Wenhao Huang

Abstract

Recent advances in coding agents suggest rapid progress toward autonomous software development, yet existing benchmarks primarily evaluate short-horizon behaviors such as localized code generation, scaffolded completion, or repository repair, leaving it unclear whether agents can sustain coherent reasoning, planning, and execution over the extended horizons demanded by real-world repository construction. To address this gap, we introduce NL2Repo-Bench, a benchmark explicitly designed to evaluate the long-horizon repository generation from scratch: given only a single natural-language requirements document and an empty workspace, agents must autonomously design the architecture, manage dependencies, and produce a fully installable Python library. Experiments across state-of-the-art open- and closed-source models reveal that long-horizon repository generation remains largely unsolved, with even the strongest agents achieving merely 40\% average test pass rates and rarely completing an entire repository correctly. Further analysis identifies systematic long-horizon failure modes, including premature termination, loss of global coherence, fragile cross-file dependencies, and inadequate planning over hundreds of interaction steps. These results position NL2Repo-Bench as a rigorous, execution-based testbed for evaluating sustained agentic competence and highlight long-horizon reasoning as a key bottleneck for autonomous coding agents. Our data and code are available at https://anonymous.4open.science/r/nl2repobench-foricml-F4ED/.