CodeClash: Benchmarking Goal-Oriented Software Engineering
Abstract
Existing coding benchmarks evaluate language models (LMs) on concrete, well-specified tasks such as fixing bugs or writing tests. However, human programmers do not spend all day addressing isolated GitHub issues. Instead, real-world software development is grounded in the pursuit of high-level goals. Evaluating whether LMs can iteratively develop code to accomplish open-ended objectives without explicit guidance remains an open challenge. We introduce CodeClash, a benchmark where LMs compete in multi-round tournaments to build the best codebase for achieving a competitive objective. Each round proceeds in two parts: agents edit their code, then their codebases compete head-to-head in a code arena that determines winners based on objectives like score maximization, resource acquisition, or survival. Models must decide for themselves how to improve their code both absolutely and against their opponents. We run 1680 tournaments to evaluate 8 LMs across 6 arenas, revealing how models exhibit diverse development styles and share fundamental limitations in strategic reasoning. Models also struggle with long-term codebase maintenance; repositories become progressively messy and redundant. Top models lose every round against expert human programmers. We open-source CodeClash to advance the study of autonomous, goal-oriented code development.