Generalization Bounds for Discrete Diffusion: Statistical Advantage of Masking
Abstract
Discrete diffusion models have recently emerged as a compelling alternative for language generation, enabling efficient non-autoregressive sampling while achieving strong empirical performance. A key design choice in discrete diffusion---absent in most continuous diffusion formulations---is the forward corruption kernel, with masked/absorbing corruption now dominating practice. Despite this empirical preference, there is limited statistical theory explaining when and why masking should outperform alternative kernels such as uniform replacement. In this paper, we take a step toward closing this gap from a statistical learning perspective. Our analysis establishes generalization bounds and, through an explicit comparison across different forward corruption kernels, reveals a central advantage of masking: it scales with the effective data support rather than the full ambient state space, thereby mitigating the curse of state space cardinality. We further derive structure-aware refinements that capture how concentration and sparsity in real sequential data sharpen the sample complexities. Together, these results offer a principled explanation for the empirical strength of masked diffusion and provide guidance for forward-kernel design in discrete generative modeling.