ARLArena: Demystifying Policy Gradient Stability in Agentic Reinforcement Learning
Xiaoxuan Wang ⋅ Han Zhang ⋅ Haixin Wang ⋅ Yidan Shi ⋅ Ruoyan Li ⋅ Kaiqiao Han ⋅ Chenyi Tong ⋅ Haoran Deng ⋅ Alexander Taylor ⋅ Renliang Sun ⋅ Yanqiao Zhu ⋅ Jason Cong ⋅ Yizhou Sun ⋅ Wei Wang
Abstract
Agentic reinforcement learning (ARL) has rapidly gained attention as a promising paradigm for training agents to solve complex, multi-step interactive tasks. In this paper, we first propose $\textbf{ARLArena}$, a fair and systematic analysis framework that encompasses a broad spectrum of ARL algorithms and decomposes policy optimization (PO) through multiple policy gradient dimensions. Through this fine-grained analysis, we distill a unified perspective on ARL and, guided by the identified governing factors, propose $\textbf{SAMPO}$, a stable agentic PO method designed to mitigate the dominant sources of instability in ARL. Empirically, SAMPO achieves consistently stable training and strong performance across diverse agentic tasks. Overall, this study provides a unifying policy gradient perspective for ARL and offers practical guidance for building stable and reproducible LLM-based agent training pipelines. Our codebase is open-sourced at https://anonymous.4open.science/r/SAMPO-02B3.
Successful Page Load