LoSA: Locality Aware Sparse Attention in Diffusion Language Models
Abstract
Block-wise diffusion language models (DLMs) generate multiple tokens in parallel, offering a promising alternative to autoregressive decoding. However, their inference efficiency remains bottlenecked by memory-bound attention in long-context scenarios. Naïve sparse attention is ineffective for DLMs due to the KV inflation problem: different queries select different prefix positions, causing the union of accessed KV pages to remain large. To address this challenge, we observe that block-wise diffusion exhibits locality of representation changes across denoising steps: only a small fraction of tokens (active tokens) undergo significant hidden-state updates, while most tokens (stable tokens) remain nearly unchanged. Based on this insight, we propose LoSA (Locality-aware Sparse Attention), which reuses cached prefix-attention results for stable tokens and applies sparse attention only to active tokens with large representation changes. This design reduces the number of queries contributing to the union of KV indices, substantially shrinking the KV pages that must be loaded. Across multiple block-wise DLMs and reasoning benchmarks, LoSA preserves near-dense accuracy while significantly improving efficiency, achieving up to 4.14× speedup over dense attention on RTX A6000 GPUs. LoSA also delivers up to 5% average improvement over baselines across all datasets and configurations, demonstrating the effectiveness of the proposed method.