Stabilizing Reinforcement Learning for Diffusion Language Models
Abstract
Diffusion Large Language Models (dLLMs) often exhibit severe instability during Group Relative Policy Optimization (GRPO) training, limiting the effectiveness of reinforcement learning for improving reasoning capabilities. In dLLMs, the importance ratios used by GRPO are derived from finite-sample estimates rather than exact likelihoods, making them inherently noisy. In this paper, we show that GRPO is highly sensitive to this noise, which drives training instability. Through theoretical analysis and empirical evidence, we identify a self-reinforcing instability loop in which noisy importance ratios induce gradient spikes and policy drift, further amplifying future importance ratio estimation variance. To address this issue, we propose StableDRL, a novel reinforcement learning framework for dLLMs. StableDRL stabilizes training via (i) unconditional clipping to suppress outlier-induced gradient spikes, and (ii) self-normalization to constrain gradients within the convex hull of per-sample updates. We further extend StableDRL to block-wise diffusion models via a staircase attention mechanism. StableDRL is the first method that enables stable, full-parameter reinforcement learning for dLLMs. It achieves the state-of-the-art performance, outperforming prior best full-attention baselines by 6% on MATH500 and block-diffusion baselines by 25.6% on AIME.