Modelling Attention with Aitchison Geometry: Token Distinguishability and Temperature Scaling
Sam Hilton-Jones ⋅ Timothy Norman ⋅ Zhanxing Zhu
Abstract
The attention mechanism with softmax normalisation is a foundational component of Transformer-based large language models. However, with very long contexts, attention scores are known to diminish, raising fundamental questions about token distinguishability and how it can be preserved. In this work, we provide a formal characterisation of token distinguishability in attention as a function of context length and embedding dimension. We introduce Aitchison distance to quantify relative differences among attention probabilities, and show that, with Gaussian queries and keys, even in the long-context regime, token distinguishability converges to a finite, non-zero limit rather than vanishing. Leveraging the linear relationship between temperature scaling and Aitchison distance, we derive a theoretical lower bound of $\Omega(\sqrt{\log L})$ on the logit scaling required to produce a sharp attention distribution. Finally, we demonstrate that Aitchison distance provides a principled and practical alternative to entropy for monitoring training and inference, as it captures the full compositional structure, including the smaller components of the attention probabilities.
Successful Page Load