Trust Region Masking for Long-Horizon LLM Reinforcement Learning
Yingru Li ⋅ Jiacai Liu ⋅ Jiawei Xu ⋅ Yuxuan Tong ⋅ Ziniu Li ⋅ Baoxiang Wang
Abstract
Policy gradient methods for Large Language Models (LLMs) optimize a policy $\pi_\theta$ via a surrogate objective computed from samples of a rollout policy $\pi_{\text{roll}}$. However, modern LLM-RL pipelines suffer from unavoidable implementation divergences—such as backend discrepancies, Mixture-of-Experts routing discontinuities, and distributed training staleness. These factors cause an off-policy mismatch ($\pi_{\text{roll}} \neq \pi_\theta$), leading to approximation errors between the surrogate and true objectives, often precipitating training collapse. We demonstrate that classical trust region bounds on this error scale as $O(T^2)$ with sequence length $T$, rendering them vacuous for long-horizon tasks. To address this, we derive two tighter bounds: a *Pinsker-Marginal* bound scaling as $O(T^{3/2})$ and a *Mixed* bound scaling as $O(T)$. Crucially, both bounds depend on $\mathcal{D}_{\text{KL}}^{\max}$—the maximum token-level KL divergence across the sequence. As this is a *sequence-level* quantity, it cannot be controlled by token-independent methods like PPO clipping. We propose Trust Region Masking (TRM), which masks entire sequences that violate the trust region. TRM theoretically provides the first non-vacuous monotonic improvement guarantees and empirically improves training stability for long-horizon LLM-RL.
Successful Page Load