HiPER: Hierarchical Plan–Execute RL for Multi-Turn LLM Agents
Abstract
Training LLMs as interactive agents for multi-turn decision-making remains challenging, particularly in long-horizon tasks with sparse and delayed rewards, where agents must execute extended sequences of actions before receiving meaningful feedback. Most existing reinforcement learning (RL) methods model LLM agents as flat policies operating at a single time scale, selecting an action at each turn. In sparse-reward settings, this forces the agent to infer long-range dependencies solely from distant end-of-trajectory signals, often leading to inefficient learning and unstable behavior in complex environments. We propose HiPER, a novel Hierarchical Plan–Execute RL framework that jointly models and optimizes high-level subgoal planning and low-level action execution for LLM agents to overcome flat RL's brittle long-horizon behavior and weak credit assignment under sparse outcome feedback. By maintaining persistent subgoals across multiple turns and explicitly deciding when to switch between them, HiPER introduces structured intermediate decision points that facilitate learning under sparse feedback, converting implicit multi-turn structure into learnable decisions at different time scales. To enable effective training, we introduce Hierarchical Advantage Estimation (HAE), a two-timescale policy gradient method that assigns credit to both action execution and subgoal transitions and achieves variance reduction relative to flat advantage estimation. Empirically, HiPER achieves state-of-the-art performance on challenging interactive benchmarks, reaching 97.4\% success on ALFWorld (+6.6\% over the best prior method) and 83.3\% on WebShop, with especially large gains on long-horizon tasks requiring multiple dependent subtasks. These results highlight the importance of explicit hierarchical decomposition for scalable RL training of multi-turn LLM agents.