Self-Evolving LLM Agents under Offline Data Support
Abstract
Large Language Models (LLMs) have recently emerged as powerful controllers for interactive agents in complex environments, yet training them to perform reliable long-horizon decision making remains a fundamental challenge. A key difficulty lies in the sparsity and delay of supervision: agents often receive feedback only at episode termination, leading to severe credit assignment issues. In this work, we propose a self-evolving framework for LLM agents that unifies automatic process-reward labeling and in-distribution policy learning within a principled offline reinforcement learning paradigm. Our method learns an in-distribution critic from a hybrid offline dataset that combines expert demonstrations with agent-generated trajectories, stabilizing Bellman backups in sparse-reward settings via a weighted Implicit Q-Learning objective. The learned value function is then used to derive step-wise process rewards through advantage estimation, enabling dense and reliable supervision without environment backtracking or human annotation. Leveraging these signals, we perform behavior-proximal policy optimization that evolves the agent while remaining strictly within the data support, allowing iterative self-improvement without exacerbating distribution shift. We evaluate our method on long-horizon interactive benchmarks, including AlfWorld, WebShop, and ScienceWorld, where it consistently outperforms strong baselines in sample efficiency, robustness, and overall task performance. Our results demonstrate that stable self-evolution of LLM agents is achievable by grounding process-level supervision and policy improvement within a shared in-distribution learning loop.