Set Diffusion: Interpolating Token Orderings between Autoregression and Diffusion for Fast and Flexible Decoding
Abstract
Masked discrete diffusion models have improved steadily, but still lag behind autoregressive (AR) models in quality, require fixed-length generation, and cannot exploit key-value (KV) caching. Block Diffusion partially bridges diffusion and AR by unmasking left-to-right token blocks, but sacrifices infilling flexibility and KV caching within blocks. Our key insight is that interpolating generation orderings between autoregression and fully-random decoding, rather than committing to a fixed block length, offers a better interpolation between diffusion and AR. We present a new class of language models, Set Diffusion, comprised of 1) a tighter likelihood bound induced by an order-informed noise process and 2) a causal diffusion architecture that enables KV caching under stochastic token orderings. We bias the noise process toward left-to-right generation, rather than enforcing a strict block factorization, such that tokens can be decoded in sliding-window sets for faster inference and greater flexibility for any-order decoding. Set Diffusion achieves better speed-quality tradeoffs on mathematical reasoning, summarization, and unconditional generation compared to prior diffusion language models while offering stronger infilling performance than Block Diffusion.