Anatomy of Massive Activations and Attention Sinks
Abstract
We study two recurring phenomena in Transformer language models. First, \emph{massive activations}, where a small number of hidden channels attain extremely large values for a few tokens. Second, \emph{attention sinks}, where certain tokens attract a disproportionate share of attention across many heads and layers. We present a unified inference-time mechanism explaining how massive activations emerge and propagate through layers, and how normalization transforms these tokens into sparse, nearly fixed vectors that reshape the attention space and induce sink or no-sink behavior. We further conduct ablations on models trained from scratch to disentangle architectural and training factors governing both phenomena. We find that attention sinks persist across architectures and can arise even without massive activations. The normalization strategy primarily determines the emergence of massive activations, while head dimension and context length modulate the frequency of attention sink formation.