SRPO: Self-Reflective Policy Optimization for Long-Horizon Reasoning
Jialong Liu ⋅ Yuling Shi ⋅ Ning Yang ⋅ Xiaodong Gu ⋅ Zuchao Li
Abstract
Self-reflection is a powerful mechanism for credit assignment in human learning, converting sparse outcome feedback into actionable guidance. However, its potential for post-training Large Language Models (LLMs) remains underexplored. We propose Self-Reflective Policy Optimization (SRPO), a framework that internalizes this capability. SRPO enables LLMs to analyze their own completed trajectories, synthesize errors into concise "reflection patches," and use these reflection-conditioned rollouts as high-quality, on-policy distillation targets. This process effectively transforms sparse terminal supervision into dense, token-level learning signals without requiring external critics, separate reward models, or larger teacher models. We demonstrate that SRPO achieves state-of-the-art performance across mathematical reasoning and long-horizon agentic benchmarks with exceptional data efficiency. Using a Qwen3-8B base model, SRPO attains 73.3\% on AIME’24 using only 8\% (0.08$\times$) of the training FLOPs required by scaled supervised fine-tuning, while significantly improving success rates on WebShop (64.7\%), ALFWorld (76.8\%), and SWE-Bench-Lite (31.2\%).
Successful Page Load