Stratified GRPO: Handling Structural Heterogeneity in Reinforcement Learning of LLM Search Agents
Abstract
Large language model (LLM) agents increasingly rely on external tools such as search engines to solve complex, multi-step problems, yet their rollouts are structurally heterogeneous: variations in tool-call number, placement, and outcomes induce distinct behaviors and reward distributions. As a result, policy gradient methods with a single global baseline suffer from cross-stratum bias, an "apples-to-oranges" comparison that distorts credit assignment and impedes exploration. To address this issue, we propose Stratified GRPO. Its core component, Stratified Advantage Normalization (SAN), partitions trajectories into homogeneous strata based on structural properties and computes advantages locally within each stratum, ensuring comparisons only among true peers. We show that SAN eliminates cross-stratum bias, yields conditionally unbiased unit-variance estimates within strata, and preserves the global unbiasedness and unit-variance properties of standard normalization, resulting in a more reliable learning signal. To improve robustness in finite-sample regimes, we further linearly blend SAN with the global estimator. Experiments on factual QA and deep-research agent benchmarks demonstrate that Stratified GRPO consistently outperforms GRPO by up to 12.6 points, achieving higher training rewards, improved training stability, and more effective search policies. These results establish stratification as a principled remedy for structural heterogeneity in RL for LLM search agents.