SPEED: Sharpened-Teacher Distillation for Parallel Decoding of Diffusion Language Models
Abstract
Diffusion-based large language models generate text by gradually filling in masked tokens, yet they remain slow because they usually decode only a few tokens per step. Parallel decoding, which unmasks multiple tokens simultaneously, promises acceleration but often degrades quality when too many tokens are predicted at once. We identify the root cause: when decoding is viewed as iterative token grouping, overly permissive grouping places interdependent tokens in the same step, violates the conditional independence assumption, and amplifies reliance on noisy context even when the top prediction is already correct. We introduce SPEED, a framework that enlarges safe parallel groups through complementary training and inference designs. At training time, a sharpened teacher distillation objective selectively aligns the student to teacher-correct positions using a temperature-scaled KL term together with a masked language modeling loss, producing a student that assigns more probability mass to correct token identities and elevates more positions above the decoding threshold. At inference time, Slow–Fast Decoding partitions tokens by sensitivity to revealed context using token-wise Jensen–Shannon Divergence computed with and without access to the preceding block, decoding high-sensitivity tokens jointly while deferring low-sensitivity tokens until sufficient context resolves them. Through extensive experiments, our framework attains up to 12.2× speedup on LLaDA-8B-Instruct and 6.7× on Dream-7B-Instruct with accuracy close to greedy decoding across standard reasoning and code benchmarks.