TileSparse: Arithmetic-Intensity-Aware Sparse Attention for Compute-Bound LLM Decoding
Abstract
Sparse attention has emerged as a vital technique for long-context inference in Large Language Models (LLMs), effectively accelerating memory-bound decoding by reducing memory access for non-essential keys. However, the assumption that decoding attention is memory-bound has been shattered. The proliferation of Multi-head Latent Attention (MLA) and Multi-Token Prediction (MTP) architectures has effectively rendered the process compute-bound. We observe that, in MLA, Q-heads exhibit a degree of sparsity even when attending to the same key; consequently, traditional sparse attention algorithms introduce significant computational inefficiency in this new regime by rigidly computing interactions between all associated Q-heads and the retrieved keys. To address this, we propose TileSparse, an arithmetic-intensity-aware (a.i.-aware) algorithm for efficient attention in compute-bound settings. We first introduce a cost model that emphasizes compute budget (compute tile size) rather than memory budget (fetched tokens) when evaluating sparse methods. Next, QK 2D Sparsity prunes unnecessary Q-head--key computations and uses the freed compute to retrieve more semantically important tokens. Because Q-head sparsity differs across keys, we further propose Tiered QK 2D Sparsity and an AutoTuner to choose the best pattern. Experiments show that under tight budgets our method improves accuracy by 40% over state-of-the-art dynamic K-only sparse methods. It also preserves 99% of full-attention accuracy while cutting attention compute by 40.8%, outperforming prior sparse attention approaches.