GameDevBench: Evaluating Agentic Capabilities Through Game Development
Wayne Chi ⋅ Yixiong Fang ⋅ Arnav Yayavaram ⋅ Siddharth Yayavaram ⋅ Seth Karten ⋅ Qiuhong Anna Wei ⋅ Runkun Chen ⋅ Alexander Wang ⋅ Valerie Chen ⋅ Ameet Talwalkar ⋅ Chris Donahue
Abstract
While coding agents have advanced rapidly, progress on multimodal agents has lagged behind, largely due to a gap between the unimodal nature of code and other multimodal computer applications. Game development bridges the modality gap, mirroring software development's complexity in terms of large codebases and contextual complexity, while simultaneously requiring multimodal understanding. We present GameDevBench, the first benchmark for evaluating agents on game development tasks, consisting of 168 tasks derived from web and video tutorials. Tasks require significant multimodal understanding and are complex---the average solution requires more than three times the amount of changes compared to software development benchmarks. Agents still struggle with game development, with the best agent solving only $50.0$\% of tasks. We further introduce two simple image and video-based feedback methods, nearly doubling performance in one setting from $25.6$\% to $44.4$\%. We find that performance degrades sharply with multimodal complexity, dropping on average from $44.4$\% pass@1 on gameplay oriented tasks to $24.3$\% on graphics tasks.
Successful Page Load