PLASH: Provably Linear-Time Attention with Selective Higher-Order Feature Sketching
Yuwen Huang ⋅ Xiang Pan
Abstract
Attention selects information from long contexts, but standard softmax attention scales as $O(N_qN_k)$ in the number of queries $N_q$ and keys $N_k$, making long-context training and inference expensive. We propose PLASH, an attention block with provably linear-time complexity in $N_k$ that preserves the usual interface: each query still returns a data-dependent weighted combination of values. PLASH first compresses the key / value side into $M \ll N_k$ learned representatives, and then restores expressivity by enriching these representatives with selective higher-order feature sketching (e.g., TensorSketch), which approximates chosen polynomial interactions without explicit feature expansion. The final softmax readout from $\mathbf{Q}$ to the enriched $(\mathbf{K}_g,\mathbf{V}_g)$ is exact, so PLASH applies to both self- and cross-attention by treating $N_q$ and $N_k$ independently. We give a runtime analysis $O(N_k M d + N_q M d)$ (plus sketching costs), provide error bounds for the randomized sketches and an end-to-end deviation analysis relative to standard attention, and show strong long-context performance with favorable scaling versus efficient-attention baselines.
Successful Page Load