dTRPO : Trajectory Reduction in Policy Optimization of Diffusion Large Language Models
Abstract
Diffusion Large Language Models (dLLMs) introduce a new paradigm for language generation and thus induce new challenges in aligning dLLMs for human preference. In this work, aim to optimize the dLLM generation process by developing a theoretical formulation and an efficient and effective quantification of the probability of generation trajectory. We prove that (i) under reference policy regularization, the probability ratio of intermediate diffusion states equals to that of the newly unmasked tokens, and (ii) the probability of the entire generation can be estimated using a single forward pass with block attention. Integrating the two estimations into preference optimization objective, we propose Trajectory Reduction Policy Optimization (dTRPO). We evaluate dTRPO on 7B dLLMs across instruction-following and reasoning benchmarks and show that it substantially improves the core performance of state-of-the-art dLLMs, achieving gains of up to 9.6\% on STEM tasks, up to 4.3\% on coding tasks, and up to 3.0\% on instruction-following tasks.