Is Your Diffusion Sampler Actually Correct? A Sampler-Centric Evaluation of Discrete Diffusion Language Models
Abstract
Discrete diffusion language models (dLLMs) offer a fast and flexible alternative to autoregressive models (ARMs) for discrete sequence generation by performing iterative denoising with parallel updates. Despite these advantages, dLLMs are commonly evaluated using metrics developed for ARMs. Such evaluations rely on metrics computed from final generated samples and conflate model approximation error in the learned denoiser and sampler-induced error from the sampling dynamics. We introduce a sampler-centric oracle evaluation framework that replaces learned denoisers with an oracle Hidden Markov Model posterior derived from a ground-truth Markov chain, enabling isolation of sampler-induced error under controlled and method-consistent settings. We show that few-step discrete diffusion samplers are not distributionally correct, even under an exact oracle denoiser, with substantial distributional mismatch at the transition level persisting at small step counts and vanishing only when the number of diffusion steps approaches the sequence length. We also find that current metrics for evaluating dLLMs are insufficient: improvements in negative log-likelihood, generative perplexity, or MAUVE do not imply correct sampling.