Faster Than Flash: Exploiting Attention Sparsity for Efficient Long-Context Decoding
Zhigeng Liu ⋅ Zhiyuan Ning ⋅ Ruixiao Li ⋅ Xiaoran Liu ⋅ Yuerong Song ⋅ Min zhang ⋅ Ziwei He ⋅ Xipeng Qiu
Abstract
The development of long-context Large Language Models (LLMs) is constrained by the memory bandwidth bottleneck and quadratic complexity of the attention mechanism during decoding. To overcome the inherent trade-offs between the memory overhead of metadata-based metrics and the computational inefficiency of adaptive selection strategies, we present \textbf{Faster Flash Decoding (FFD)}, a novel hardware-algorithm co-design framework designed to break the memory wall in long-context decoding. FFD integrates the selector and computer into a fully fused kernel, replacing external metadata indices with content-aware scanning via low-bit quantization. Furthermore, we introduce the top-$\delta$ strategy, which dynamically filters blocks to achieve distribution-adaptive sparsity without global synchronization. As a training-free, plug-and-play solution, FFD enables the reuse of scanning results for computation, achieving up to 11.6x kernel-level speedup at 256k context length and 2.37x end-to-end throughput improvement. Empirical validation on Ruler and Longbench confirms that FFD maintains model accuracy while delivering high-ratio sparsity.
Successful Page Load