Towards Understanding Massive Activations in Attention Sink Mechanism
Abstract
Recent studies have revealed two intriguing phenomena in large language models: massive activations, characterized by a small number of activations exhibiting abnormally large magnitudes, and attention sink, where a disproportionate amount of attention is consistently allocated to specific tokens regardless of their semantic relevance. However, the co-emergence and co-existence of these two phenomena remain poorly understood. In this work, we revisit the prevailing view that massive activations are the primary mechanism responsible for concentrating attention on sink tokens, and provide a more nuanced interpretation of their relationship. Through both theoretical analysis and empirical evidence, we demonstrate that massive activations and attention sink jointly act to prevent excessive token mixing in self-attention. Specifically, attention sink suppresses mixing among non-sink tokens, whereas massive activations suppress mixing between sink tokens and non-sink tokens. Furthermore, our theory provides a principled explanation of how the location of massive activations depends on the placement of layer normalization, and why KV-biases and gating mechanisms can remove massive activations while largely preserving attention sink. We further conduct intervention analyses and find that removing the value vector of the sink token can recover attention sink even when massive activations are entirely suppressed. Overall, this work provides a mechanistic perspective on how massive activations and attention sink interact under normalization and self-attention, offering new insights into their functional roles in Transformer models.