Clipping Bottleneck: Stabilizing RLVR via Stochastic Recovery of Near-Boundary Signals
Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a central paradigm for scaling LLM reasoning, yet its optimization often suffers from training instability and suboptimal convergence. Through a systematic dissection of the GRPO-based objective, we reveal that the rigid clipping decision inherent to the hard-clipping mechanism is the primary bottleneck. Specifically, we find that many high-value signals lie in the near-boundary region just beyond the clipping threshold, and are thus discarded. Motivated by this diagnosis, we propose Near-boundary Stochastic Rescue (NSR), a minimal, plug-and-play modification that stochastically retains these slightly out-of-bound tokens to recover lost signals. While NSR, via stochastic sampling, can be interpreted as inducing an implicit gradient decay in expectation, our ablations reveal that its stochastic, boundary-local rescue mechanism is consistently more effective than deterministic gradient decay. Validated by extensive experiments across model sizes from 7B to 30B and both dense and MoE architectures, as a plug-and-play solution, NSR substantially improves training stability and delivers consistent gains over strong baselines such as DAPO and GSPO.