Poster
in
Workshop: Programmatic Representations for Agent Learning

DyPO: Dynamic Policy Optimization for Multi-Turn Interactive Reasoning

Xiao Feng ⋅ Bo Han ⋅ Zhanke Zhou ⋅ Jiaqi Fan ⋅ Jiangchao Yao ⋅ Ka Li ⋅ Dahai Yu ⋅ Michael Ng

Project Page [ OpenReview]

Abstract

Existing on-policy reinforcement learning methods, such as Group Relative Policy Optimization (GRPO) and its variants, have enhanced the reasoning capabilities of large language models (LLMs). However, these methods often rely on static, pre-trained knowledge to navigate partially observed contexts, limiting their effectiveness in dynamic and evolving environments. In such settings, LLMs must actively interact with the environment to gather critical information, necessitating further advancements in adaptive reasoning strategies. To mitigate this gap, we introduce $\textbf{Dy}$namic $\textbf{P}$olicy $\textbf{O}$ptimization (DyPO), which extends GRPO for multi-turn optimization in dynamic environments. In principle, DyPO guarantees the shifting of reasoning pattern from static to dynamic multi-turn reasoning and stablize the training process involving environmental information. DyPO incorporates four key innovations: (1) distinct thinking and action tokens that integrate real-time environmental feedback during rollouts, (2) removal of divergence regularization for dynamic reasoning transition, (3) masked intermediate observations with simplified advantage estimation for enhanced stability, and (4) auxiliary resampling with rejection sampling to mitigate over-generation noise. These enhancements enable DyPO to achieve adaptive alignment with multi-turn interactive reasoning. Evaluations on challenging simulated benchmarks, ALFWorld and WebShop, using two instantiations of DyPO with Qwen-2.5-3B-Instruct consistently demonstrate substantial improvements in both interactive decision-making and reasoning capabilities compared to existing approaches.

Chat is not available.