Spotlight Poster
Multi-Turn Code Generation Through Single-Step Rewards
Arnav Kumar Jain · Gonzalo Gonzalez-Pumariega · Wayne Chen · Alexander Rush · Wenting Zhao · Sanjiban Choudhury
East Exhibition Hall A-B #E-2600
Abstract:
We address the problem of code generation from multi-turn execution feedback. Existing methods either generate code without feedback or use complex, hierarchical reinforcement learning to optimize multi-turn rewards.We propose a simple yet scalable approach, $\mu$CODE, that solves multi-turn code generation using only single-step rewards.Our key insight is that code generation is a one-step recoverable MDP, where the correct code can be recovered from any intermediate code state in a single turn.$\mu$CODE iteratively trains both a generator to provide code solutions conditioned on multi-turn execution feedback and a verifier to score the newly generated code.Experimental evaluations show that our approach achieves significant improvements over state-of-the-art baselines. We provide analysis of the design choices of the reward models and policy, and show the efficacy of $\mu$CODE at utilizing the execution feedback.
Lay Summary:
We want agents that can generate correct code for us but doing so in one try can be difficult without unit test feedback, so we focus on multi-turn code generation where an agent can iteratively refine its solution using execution feedback. However, training agents with such feedback (correct/incorrect) using reinforcement learning is challenging due to sparse rewards signals which makes learning inefficient.Our work introduces $\mu$Code, a simple and scalable approach to make this process more effective. First, we observe that a correct code solution can be generated at any step, meaning the agent can "recover" in a step — we call this *one-step recoverability*. Second, instead of relying on sparse rewards, we *learn a verifier* to provide a richer score to make learning easier. These insights allow us to reduce the problem from complex reinforcement learning to imitation learning, making training more stable. In addition, by learning a verifier we can output multiple solutions and choose the highest scoring solution during inference-time in a *multi-turn Best-of-N search*.We release our models for generating code and verifying code for researchers to contribute to the self-improving model community. Developing stronger generators and verifiers in conjunction will produce agents that are stronger and more reliable at code generation over multiple steps.
Chat is not available.
Successful Page Load