Scheduling Thoughts: Learning the Order of Thought in Diffusion Language Models
Jiawei Xu ⋅ Minghui Liu ⋅ Aakriti Agrawal ⋅ Yifan Chen ⋅ Furong Huang
Abstract
Masked diffusion language models decode by iteratively unmasking tokens, where the unmasking order defines an ``order of thought'' that strongly influences generation quality yet is typically chosen heuristically. We derive a tractable upper bound on the sequential decoding mismatch, measured by the Kullback–Leibler divergence and expressed in terms of the model’s pathwise log-likelihood, with tightness under sufficient model expressivity. This bound induces a dense self-aware reward for a target sequence $x$ and unmasking order $\sigma$, over ordered paths, casting order selection as a principled policy optimization problem with a frozen denoiser. We instantiate this idea as **Self-Aware Scheduling (SAS)**, which learns a lightweight order policy using Group Relative Policy Optimization and applies seamlessly to both sequential and semi-autoregressive decoding. On Sudoku with 1B MDM, SAS improves puzzle accuracy from $82.0\%$ (best heuristic schedule) to $91.8\%$, and reaches $97.9\%$ with second-stage fine-tuning along learned trajectories. On LLaDA-8B, SAS improves pass@1 on GSM8K from $64\%$ to $76\%$ (full diffusion) and on MBPP from $39.5\%$ to $41\%$, while consistently matching or exceeding heuristic schedules across generation lengths and block sizes.
Successful Page Load