Poster

Set Diffusion: Interpolating Token Orderings between Autoregression and Diffusion for Fast and Flexible Decoding

Marianne Arriola ⋅ Volodymyr Kuleshov

Abstract

Masked discrete diffusion models have improved steadily, but still lag behind autoregressive (AR) models in quality, require fixed-length generation, and cannot exploit key-value (KV) caching. Block Diffusion partially bridges diffusion and AR by unmasking left-to-right token blocks, but sacrifices infilling flexibility and KV caching within blocks. Our key insight is that interpolating generation orderings between autoregression and fully-random decoding, rather than committing to a fixed block length, offers a better interpolation between diffusion and AR. We present a new class of language models, Set Diffusion, comprised of 1) a tighter likelihood bound induced by an order-informed noise process and 2) a causal diffusion architecture that enables KV caching under stochastic token orderings. We bias the noise process toward left-to-right generation, rather than enforcing a strict block factorization, such that tokens can be decoded in sliding-window sets for faster inference and greater flexibility for any-order decoding. Set Diffusion achieves better speed-quality tradeoffs on mathematical reasoning, summarization, and unconditional generation compared to prior diffusion language models while offering stronger infilling performance than Block Diffusion.