Poster Tue, Jul 7, 2026 • 10:30 PM – 12:15 AM PDT HALL A #1715

Simple Policy Gradients for Reasoning with Diffusion Language Models

Anthony Zhan

Project Page

Abstract

Diffusion large language models (dLLMs) represent a promising alternative to autoregressive LLMs; however, the lack of effective post-training techniques, including reinforcement learning (RL), remains a key challenge for dLLMs, especially for downstream applications. Existing approaches often rely on a sequence-level view that requires biased likelihood approximations. In this work, we propose Amortized Group Relative Policy Optimization (AGRPO), a policy gradient algorithm that leverages the Markovian nature of dLLM generation, optimizing individual denoising steps rather than full sequences. Our approach improves the theoretical alignment between training and inference policies and also admits efficient, unbiased gradient updates via a novel timestep estimation scheme. We demonstrate AGRPO's effectiveness on different math and reasoning tasks, achieving absolute accuracy gains of +59.4\% and +69.7\% on Countdown and Sudoku over the base LLaDA model, exceeding comparable methods such as diffu-GRPO.

Lay Summary

Large language models learn to predict the next token in a left-to-right manner. Diffusion large language models (dLLMs) instead learn to predict multiple tokens in parallel, with exciting possible applications in coding, reasoning, etc. However, adapting modern techniques like reinforcement learning (RL) to these models is not a straightforward task. In this paper, we develop a new way of training dLLMs with RL. We design our algorithm around the multi-step process that dLLMs use to generate text, showing how this mitigates previous issues while still being easy to compute. Our approach leads to significant performance gains on multiple reasoning tasks, opening up new perspectives on how to efficiently train dLLMs.