TGPO: Efficient Policy Optimization through Sequence Anchor and Information Gating
Abstract
Reinforcement learning from verifiable rewards (RLVR) has become an important paradigm for enhancing the reasoning capabilities of large language models, while it also involves a persistent tradeoff between optimization stability and learning efficiency. Token-level importance weighting supports fine-grained credit assignment, but it often introduces high variance and unstable parameter updates, whereas sequence-level optimization provides more stable learning dynamics while failing to fully exploit informative local signals. We introduce Trust-Gated Policy Optimization (TGPO), an efficient policy optimization framework that integrates two complementary mechanisms, namely sequence anchors and information gates. TGPO aligns token-wise updates with a stable sequence-level reference, which reduces the influence of extreme local likelihood fluctuations on the gradient, and a trust-based information gate adaptively modulates the contribution of token-level signals. By retaining and reweighting gradients from imperfect trajectories rather than excluding them, TGPO improves gradient utilization and sample efficiency while maintaining stable optimization behavior. Empirical results across seven mathematical reasoning datasets and multiple model scales show that TGPO consistently enhances learning efficiency and overall performance in outcome-supervised reinforcement learning settings.