ICML Poster Multi-Turn Code Generation Through Single-Step Rewards

Spotlight Poster

Multi-Turn Code Generation Through Single-Step Rewards

Arnav Kumar Jain · Gonzalo Gonzalez-Pumariega · Wayne Chen · Alexander Rush · Wenting Zhao · Sanjiban Choudhury

East Exhibition Hall A-B #E-2600

[ Abstract ] [ Lay Summary ]

[ OpenReview]

Wed 16 Jul 4:30 p.m. PDT — 7 p.m. PDT

Abstract: We address the problem of code generation from multi-turn execution feedback. Existing methods either generate code without feedback or use complex, hierarchical reinforcement learning to optimize multi-turn rewards.We propose a simple yet scalable approach, $\mu$CODE, that solves multi-turn code generation using only single-step rewards.Our key insight is that code generation is a one-step recoverable MDP, where the correct code can be recovered from any intermediate code state in a single turn.$\mu$CODE iteratively trains both a generator to provide code solutions conditioned on multi-turn execution feedback and a verifier to score the newly generated code.Experimental evaluations show that our approach achieves significant improvements over state-of-the-art baselines. We provide analysis of the design choices of the reward models and policy, and show the efficacy of $\mu$CODE at utilizing the execution feedback.

Lay Summary: We want agents that can generate correct code for us but doing so in one try can be difficult without unit test feedback, so we focus on multi-turn code generation where an agent can iteratively refine its solution using execution feedback. However, training agents with such feedback (correct/incorrect) using reinforcement learning is challenging due to sparse rewards signals which makes learning inefficient.Our work introduces $\mu$Code, a simple and scalable approach to make this process more effective. First, we observe that a correct code solution can be generated at any step, meaning the agent can "recover" in a step — we call this *one-step recoverability*. Second, instead of relying on sparse rewards, we *learn a verifier* to provide a richer score to make learning easier. These insights allow us to reduce the problem from complex reinforcement learning to imitation learning, making training more stable. In addition, by learning a verifier we can output multiple solutions and choose the highest scoring solution during inference-time in a *multi-turn Best-of-N search*.We release our models for generating code and verifying code for researchers to contribute to the self-improving model community. Developing stronger generators and verifiers in conjunction will produce agents that are stronger and more reliable at code generation over multiple steps.

Chat is not available.