GTPO and GRPO-S: Token and Sequence-Level Reward Shaping with Policy Entropy
Hongze Tan ⋅ Zihan Wang ⋅ Jianfei Pan ⋅ Jinghao Lin ⋅ Hao Wang ⋅ Yifan Wu ⋅ Tao Chen ⋅ Zhihang Zheng ⋅ Tang ⋅ Haihua Yang
Abstract
Reinforcement Learning (RL) is pivotal for enhancing Large Language Model (LLM) reasoning, yet mainstream algorithms such as GRPO and DAPO remain constrained by a coarse-grained credit assignment paradigm, where all tokens within the same response receive the identical reward. In this paper, we propose **Dynamic Entropy Weighting**, systematically define entropy-based weight ratios $\frac{H_{i,t}}{\sum_{k=1}^{n} H_{k,t}}$ and similar variants to redistribute rewards and get fine-grained rewards through two new algorithms: **Group Token Policy Optimization (GTPO)**, which assigns an entropy-weighted reward to each token and synthesizes token-specific advantage function to drive the model toward optimal path, and the analogous algorithm **Sequence-Level GRPO (GRPO-S)**, which admits a completely similar design at the sequence level. Unlike methods using entropy as mere regularization, GTPO and GRPO-S establish a new state-of-the-art on AIME and MATH 500, outperforming prior entropy-guided baselines and validating our weighting mechanism.
Successful Page Load