When Does Polynomial Attention Concentrate? A Relative-Margin Diagnostic for Zero-Shot Softmax Substitution
Sanny Kim
Abstract
We study zero-shot pointwise substitution of softmax attention in pretrained language models, where the score row is held fixed and only the row map is replaced. For normalized polynomial attention with active-branch degree $p$, the target receives mass $1/(1+S_p)$, where $S_p = \sum_{j \neq } (r_j/r_)^p$, and the criterion $S_p < 1$ is exactly the recall criterion under aligned-binary value vectors. Softmax is uniquely characterized among continuous positive pointwise normalized maps by additive row-shift invariance. In contrast, polynomial attention concentrates according to relative margin $(1-\Delta/r_*)^p$, yielding a worst-case sufficient degree $p^\star = \log A / \log(1/(1-\rho))$ for $A$ active distractors at relative margin $\rho$, in contrast to softmax’s absolute-margin envelope $1/(1 + A e^{-\beta \Delta})$. We audit four small open-weight decoder LMs at contexts up to 4096 tokens. Three of four models exceed 70% unsafe active-unsaturated rows on every diagnostic suite at $p = 2, b = 0$, and a 14-recipe substitution sweep finds no universal drop-in replacement. However, on an uncontaminated needle-in-a-haystack probe, two of the four models admit sample-perfect polynomial recipes. We position $S_p$ as a no-go signal: in our tested cells, very low measured-safe fraction was a sharp negative screen, but high measured-safe fraction did not guarantee preservation, implying that behavioral substitution tests remain necessary even when diagnostic statistics appear favorable.
Successful Page Load