UDM-GRPO: Stable and Efficient Group Relative Policy Optimization for Uniform Discrete Diffusion Models
Jiaqi Wang ⋅ Haoge Deng ⋅ Ting Pan ⋅ Yang Liu ⋅ Chengyuan Wang ⋅ Fan Zhang ⋅ Yonggang Qi ⋅ Xinlong Wang
Abstract
Uniform Discrete Diffusion (UDM) has recently emerged as a promising paradigm for discrete generative modeling; however, its integration with reinforcement learning remains largely unexplored. We observe that naively adapting GRPO to UDM leads to unstable training and marginal performance. To address this, we propose \Ours, the first framework that integrates UDM with RL. Our method is guided by two key insights: (i) treating the final clean sample, rather than intermediate predicted sample, as the action provides more accurate and stable optimization signals; and (ii) adopting the forward process to reconstruct the training trajectories helps the model learn probability paths that are more consistent with pretraining. For efficiency, we introduce Reduction-Step and CFG-Free training strategies. \Ours significantly improves the performance of the base model across multiple T2I tasks. Notably, GenEval accuracy improves from $69\%$ to $96\%$ and PickScore increases from $20.46$ to $23.81$, achieving state-of-the-art performance in both continuous and discrete settings. On the OCR benchmark, accuracy improves from $4\%$ to $57\%$, further validating the effectiveness and generalization capability of our method.
Successful Page Load