Distillation Models are Good Samplers for Diffusion Reinforcement Learning
Abstract
We present DMSampler, a framework that accelerates diffusion reinforcement learning by using fast distillation models as its training-time sampling engine. It overcomes the key bottleneck of sampling from the policy model—typically requiring around 50 denoising steps—by employing a co-evolving distilled sampler that needs only 4–8 steps, yielding an order-of-magnitude speedup. This approach inherently offers several advantages: it drastically reduces sampling steps, operates without classifier-free guidance to prevent potential optimization bias, and often yields superior sample quality due to more deterministic denoising trajectories. The core of DMSampler is a dual iterative training scheme, where the policy model and the distillation sampler are alternately optimized to convergence. This scheme is enhanced by two key innovations: hybrid distillation sampling, which blends outputs from both models to ensure training stability, and reward-aware distillation, which explicitly preserves high-reward capabilities during knowledge transfer. Extensive experiments on text-to-image and text-to-video generation demonstrate that DMSampler produces a final policy model which achieves state-of-the-art performance—significantly boosting textual accuracy on OCR-specific benchmarks and outperforming existing diffusion RL methods on comprehensive GenEval and VBench benchmarks.