Anchoring Self-Play for Code Repair
Caroline Choi ⋅ Zeyneb Kaya ⋅ Shirley Wu ⋅ Tengyu Ma ⋅ Tatsunori Hashimoto ⋅ Ludwig Schmidt
Abstract
Code repair is an important capability for language models (LMs): given a buggy program and unit tests, an LM must produce a fixed program that passes the tests. We aim to scale supervision for code repair by having an LM generate bug--fix tasks with unconstrained edits, using unit tests as the only verifier. We propose generator-fixer self-play, in which a single model is trained with reinforcement learning to alternate between generating bugs and fixing them. As the fixer improves, the generator adapts to produce increasingly difficult bugs, yielding an automatic curriculum. However, because unit tests certify correctness but not realism, we find that the generator can drift from bugs encountered in practice, improving repair on self-generated bugs while degrading on real-world bugs. We propose Anchored Self Play (ASP), which anchors self-play with a small reference set by (i) adding a code-embedding similarity reward to guide generation and (ii) mixing reference bugs into fixer training to prevent drift. To reflect LM-assisted programming, where bugs come from humans, LMs, and human edits of LM code, we introduce BugSourceBench, a code repair benchmark spanning human-authored bugs, human-edited buggy LM code, and errors in LM-generated code. Across bug sources, ASP achieves the best fix rates, improving average fix rate by $+25$% (relative) / $+7.2$ pp (absolute) over standard self-play, with gains on both LM-error bugs ($+100$% relative / $+11$ pp absolute) and human-authored bugs ($+7.1$% relative / $+3.4$ pp absolute).
Successful Page Load