Poster Wed, Jul 8, 2026 • 2:30 PM – 4:15 PM KST Coex: HALL A

Self-Evolving LLM Agents under Offline Data Support

Yudi Zhang ⋅ Meng Fang ⋅ Zhenfang Chen ⋅ Mykola Pechenizkiy

Abstract

Large Language Models (LLMs) have recently emerged as powerful controllers for interactive agents in complex environments, yet training them to perform reliable long-horizon decision making remains a fundamental challenge. A key difficulty lies in the sparsity and delay of supervision: agents often receive feedback only at episode termination, leading to severe credit assignment issues. In this work, we propose a self-evolving framework for LLM agents that unifies automatic process-reward labeling and in-distribution policy learning within a principled offline reinforcement learning paradigm. Our method learns an in-distribution critic from a hybrid offline dataset that combines expert demonstrations with agent-generated trajectories, stabilizing Bellman backups in sparse-reward settings via a weighted Implicit Q-Learning objective. The learned value function is then used to derive step-wise process rewards through advantage estimation, enabling dense and reliable supervision without environment backtracking or human annotation. Leveraging these signals, we perform behavior-proximal policy optimization that evolves the agent while remaining strictly within the data support, allowing iterative self-improvement without exacerbating distribution shift. We evaluate our method on long-horizon interactive benchmarks, including AlfWorld, WebShop, and ScienceWorld, where it consistently outperforms strong baselines in sample efficiency, robustness, and overall task performance. Our results demonstrate that stable self-evolution of LLM agents is achievable by grounding process-level supervision and policy improvement within a shared in-distribution learning loop.