Multipole Semantic Attention: A Fast Approximation of Softmax Attention for Pretraining
Abstract
We present Multipole Semantic Attention (MuSe), an efficient approximation of softmax attention for long-context transformers. MuSe clusters queries and keys separately in their learned representation spaces, computing query-specific cluster summaries that capture how each query cluster attends to each key cluster. This is combined with retrieval of high-attention clusters for exact computation. Unlike prior work that clusters only keys, our separate query clustering provides a ~9× effective cluster count advantage, enabling high approximation quality at extreme sparsity. For causal attention, we introduce a block-sparse structure with causal accumulation of cluster summaries across spatial blocks, followed by two-level retrieval. At 64k context, MuSe achieves 64× sparsity in the far-field attention with <1% relative squared error and 2× speedup over CUDNN Flash Attention on isolated attention layers. We pretrain language models up to 1B parameters at 64k context, achieving 36% wallclock speedup with <1% loss degradation.