Poster Mon, Jul 6, 2026 • 6:30 PM – 8:15 PM PDT HALL A #125

StepCodeReasoner: Aligning Code Reasoning with Stepwise Execution Traces via Reinforcement Learning

Hao Wang ⋅ Lei Sha ⋅ Jie Zhang

Abstract

Existing code reasoning methods primarily supervise final code outputs, ignoring intermediate states, often leading to reward hacking where correct answers are obtained through inconsistent reasoning. We propose StepCodeReasoner, a framework that introduces explicit intermediate execution-state supervision. By automatically inserting structured print-based execution-trace anchors into code, the model is trained to predict runtime states at each step, transforming code reasoning into a verifiable, stepwise execution modeling problem. Building on this execution-aware method, we introduce Dual-Granularity GRPO (DG-GRPO), a reinforcement learning algorithm for structured credit assignment at two complementary granularities: inter-trajectory comparison across sampled execution paths and intra-trajectory shaping based on downstream correctness. Extensive experiments demonstrate that StepCodeReasoner achieves state-of-the-art performance in code reasoning. In particular, our 7B model achieves 91.1\% on CRUXEval and 86.5\% on LiveCodeBench, outperforming the CodeReasoner-7B baseline (86.0\% and 77.7\%) and GPT-4o (85.6\% and 75.1\%). Furthermore, on the execution-trace benchmark REval, our model scores 82.9\%, outperforming baseline CodeReasoner-7B (72.3\%), its 14B counterpart (81.1\%), and GPT-4o (77.3\%). In addition, StepCodeReasoner improves code generation, achieving 90.1 on HumanEval, 85.0 on MBPP, and 19.4 on LiveCodeBench generation, with an average score of 64.8 versus 62.6 for CodeReasoner-7B.

Lay Summary

This paper studies how to make AI systems better at understanding and writing computer programs. Many current models are trained mainly to produce the final answer, but this can hide mistakes in the reasoning process: a model may sometimes get the right output while following an incorrect chain of logic. We propose a method called StepCodeReasoner, which teaches a model to track what happens inside a program while it runs. Instead of only checking the final answer, we add simple checkpoints that reveal important intermediate values during execution. The model is then trained to predict these step-by-step states, making its reasoning more grounded in how the program actually behaves. Our experiments show that this approach improves performance on several code reasoning benchmarks and also helps with code generation tasks. More broadly, the results suggest that giving AI systems verifiable intermediate feedback can make them more reliable, especially for tasks where the final answer alone is not enough to tell whether the reasoning was correct.