G$^2$RPO: Geometric GRPO; Escaping LLM's Reasoning Rut to Break Accuracy--Entropy Trade-off
Ali Rad ⋅ Khashayar Filom ⋅ Darioush Keivan ⋅ Peyman Mohajerin Esfahani ⋅ Ehsan Kamalinejad
Abstract
Reinforcement learning with verifiable rewards (RLVR) is a cornerstone of post-training for large reasoning models, yet widely used algorithms such as Group Relative Policy Optimization (GRPO) often exhibit \textbf{diversity collapse}. We provide a geometric diagnosis by formalizing GRPO as a dynamical flow on the probability simplex. Under a mode-based coarse-graining of rollouts, we show that GRPO induces a \textbf{collision field} over correct modes, monotonically pushing towards simplex vertices and thus yielding a \textbf{winner-take-all} regime. To address this systematically, we introduce \textbf{G$^2$RPO (Geometric GRPO)}, which reshapes RLVR via principled \textbf{vector-field editing}. Concretely, we intervene at the advantage level by adding granularity bonuses inversely proportional to mode probabilities, encouraging underrepresented correct modes. The bonus has a natural geometric interpretation, and its potential performance side effects can be mitigated, thereby avoiding the usual accuracy--diversity trade-off. In experiments with 7B and 14B models trained on a math reasoning task and evaluated on \textbf{AIME 2024/2025}, GRPO loses up to \textbf{57\%} of active correct modes. In contrast, G$^{2}$RPO increases active correct-mode coverage by \textbf{172\%--205\%}, reduces concentration on any single correct mode, prevents the late-stage \emph{entropy crash}, and improves \texttt{pass@1} by \textbf{+1.4} to \textbf{+7.9} points relative to GRPO. Overall, diversity is not merely a regularizer but a \textbf{geometric property} to be controlled to improve the model without trapping it in a single dominant strategy.
Successful Page Load