Poster Wed, Jul 8, 2026 • 10:30 AM – 12:15 PM KST Coex: HALL A

dTRPO : Trajectory Reduction in Policy Optimization of Diffusion Large Language Models

Wenxuan Zhang ⋅ Lemeng Wu ⋅ Changsheng Zhao ⋅ Ernie Chang ⋅ Mingchen Zhuge ⋅ Zechun Liu ⋅ Andy (DiJia) Su ⋅ Hanxian Huang ⋅ Jun Chen ⋅ Chong Zhou ⋅ Raghuraman Krishnamoorthi ⋅ Vikas Chandra ⋅ Mohamed Elhoseiny ⋅ Wei Wen

Project Page

Abstract

Diffusion Large Language Models (dLLMs) introduce a new paradigm for language generation and thus induce new challenges in aligning dLLMs for human preference. In this work, aim to optimize the dLLM generation process by developing a theoretical formulation and an efficient and effective quantification of the probability of generation trajectory. We prove that (i) under reference policy regularization, the probability ratio of intermediate diffusion states equals to that of the newly unmasked tokens, and (ii) the probability of the entire generation can be estimated using a single forward pass with block attention. Integrating the two estimations into preference optimization objective, we propose Trajectory Reduction Policy Optimization (dTRPO). We evaluate dTRPO on 7B dLLMs across instruction-following and reasoning benchmarks and show that it substantially improves the core performance of state-of-the-art dLLMs, achieving gains of up to 9.6\% on STEM tasks, up to 4.3\% on coding tasks, and up to 3.0\% on instruction-following tasks.