Generative Online Reinforcement Learning
Chubin Zhang ⋅ Zhenglin Wan ⋅ Feng Chen ⋅ Fuchao Yang ⋅ Lang Feng ⋅ Yaxin Zhou ⋅ Xingrui Yu ⋅ Yang You ⋅ Ivor Tsang ⋅ Bo An
Abstract
Reinforcement learning (RL) faces a persistent tension: policies that are stable to optimize (e.g., Gaussians) are often too simple to represent the multimodal action distributions required for complex control. Conversely, expressive generative policies—such as diffusion and flow matching—are frequently unstable in online RL due to intractable likelihoods and gradients propagating through long sampling chains. We address this tension with a key structural principle: decoupling optimization from generation. Building on this, we introduce GoRL (Generative Online Reinforcement Learning), an algorithm-agnostic framework that trains expressive policies from scratch by confining policy optimization to a tractable latent space while delegating action synthesis to a conditional generative decoder. Using a two-timescale alternating schedule and anchoring decoder refinement to a fixed prior, GoRL enables stable optimization while continuously expanding expressiveness. Empirically, GoRL consistently outperforms unimodal and generative baselines across diverse continuous-control tasks. Notably, on the challenging HopperStand task, it achieves episodic returns exceeding 870—more than $3\times$ that of the strongest baseline—demonstrating a practical path to policies that are both stable and highly expressive.
Successful Page Load