Rank-Aware Spectral Bounds on Attention Logits for Stable Low-Precision Training
Seyed Morteza Emadi
Abstract
Attention scores in transformers are bilinear forms $S_{ij} = x_i^\top M x_j / \sqrt{d_h}$ whose maximum magnitude governs overflow risk in low-precision training. We derive a \emph{rank-aware concentration inequality}: when the interaction matrix $M = W^Q W^{K\top}$ has rank $r \ll d$, tail probabilities for $\max_{i,j}|S_{ij}|$ decay as $\exp(-d^2\alpha^2/r)$ rather than $\exp(-d\alpha^2)$, an improvement of $d/r$ in the exponent. For transformer attention where $r = d_h$, this yields $25$--$64\times$ tighter concentration than rank-agnostic bounds in modern architectures. We apply this result to FP8 training, deriving *geometry-aware scale factors* that provide provable overflow guarantees without observing activations. The method computes per-layer scales from the spectral norm $\|W^Q W^{K\top}\|_2$ via implicit power iteration, includes a grouped query attention formulation that avoids key expansion, and remains compatible with fused attention kernels. Across GPT-2 XL to Llama-2-70B, geometry-aware scaling eliminates overflows in transient scenarios where delayed scaling fails, while matching downstream MMLU accuracy.
Successful Page Load