Noisy-Space Policy Gradient for Diffusion Policies in Offline Reinforcement Learning
Abstract
Diffusion policies offer a powerful and expressive parameterization for continuous control. Yet, their integration with reinforcement learning remains conceptually and algorithmically challenging. In this work, we address this gap by introducing a noisy-space action-value (Q-)function that assigns values to diffusion latents through the distribution of executed actions induced by the denoising process. We show that this construction admits a precise semantic interpretation and derive a noisy-space policy gradient (NSPG) in which value estimates for noisy latents are computed exclusively using clean action-space values. Building on this result, we formulate a KL-regularized policy improvement over noisy latents and show that the resulting objective admits a diffusion-compatible regression form, avoiding backpropagation through the denoising process. Experiments on the D4RL benchmark demonstrate that semantically grounded value gradients provide a principled, effective foundation for training diffusion policies in offline reinforcement learning.