Langevin Rollout Optimization for Modelic Reinforcement Learning
Abstract
Planning-driven model-based (modelic) reinforcement learning has achieved impressive success in continuous control tasks but predominantly relies on zero-order optimizers like Model Predictive Path Integral (MPPI). While robust for global exploration, MPPI updates actions solely through sampling and neglects the smooth return gradients inherent in structured dynamics that guide fine-grained search. To complement MPPI’s robustness with gradient-guided precision, we first propose \textbf{La}ngevin \textbf{R}ollout \textbf{O}ptimization (LaRO), which leverages return gradients to refine actions via Langevin dynamics, achieving reliable local convergence without sacrificing multimodal exploration. This is supported by a score-augmented world model that jointly learns dynamics and a score function within a unified latent space, facilitating efficient and accurate gradient estimation for real-time planning. Second, we combine MPPI and LaRO through a simple yet effective choice mechanism, termed \textbf{M}aximum \textbf{L}ook-\textbf{A}head \textbf{P}lanning (MLAP). Finally, we instantiate MLAP within the latest BOOM algorithm, replacing its MPPI-only planner and yielding BOOM-L. Empirical results on the DeepMind Control Suite and Humanoid Bench demonstrate that BOOM-L consistently outperforms strong baselines in both sample efficiency and final performance.