Poster Mon, Jul 6, 2026 • 6:30 PM – 8:15 PM PDT HALL A #1704

Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning

Zichao Li ⋅ Jie Lou ⋅ Fangchen Dong ⋅ Zhiyuan Fan ⋅ Mengjie Ren ⋅ Hongyu Lin ⋅ Xianpei Han ⋅ Debing Zhang ⋅ Le Sun ⋅ Yaojie Lu ⋅ XingYu

Abstract

Reinforcement learning significantly enhances LLM capabilities but suffers from a critical issue: length inflation, where models adopt verbosity or inefficient reasoning to maximize rewards. Prior approaches struggle to address this challenge in a general and lossless manner, primarily because additive penalties introduce a compensatory effect that creates optimization shortcuts, while heuristic gating strategies lack generality beyond binary feedback. To bridge this gap, we present Group Relative Reward Rescaling (GR$^3$), which reframes length control as a multiplicative rescaling paradigm, effectively establishing a generalized, continuous, and reward-dependent gating mechanism. To further ensure lossless optimization, we incorporate group-relative regularization and advantage-aware calibration, which dynamically adapt length budgets to instance difficulty and preserve the advantage signal of high-quality trajectories. Empirically, across both RLHF and RLVR settings, GR$^3$ maintains training dynamics and downstream performance comparable to standard GRPO while significantly mitigating length inflation, outperforming state-of-the-art length-regularized baselines.

Lay Summary

Reinforcement learning has become a key technique for improving large language models, but it can also make them unnecessarily verbose. In many cases, models learn to produce longer answers or reasoning traces not because they are truly better, but because longer outputs can receive higher rewards or slightly increase the chance of success. This length inflation raises inference costs and makes model outputs less efficient. This paper introduces Group Relative Reward Rescaling, or GR³, a method that encourages models to be concise without weakening their ability to solve tasks. Instead of subtracting a fixed penalty for length, GR³ adjusts the reward according to how long a model’s answer is compared with other answers to the same problem. This allows the method to give harder problems more room for reasoning while discouraging unnecessary verbosity on easier ones. Across reasoning, coding, and human-feedback-based training settings, GR³ reduces output length while maintaining or improving model performance. These results suggest that stronger language models do not necessarily need to think longer; with better training signals, they can learn to reason more efficiently.