A Capacity-Based Rationale for Multi-Head Attention
Micah Adler
Abstract
We study the capacity of the self-attention key-query channel: for a fixed budget, how many distinct token-token relations can a single layer reliably encode? We introduce *Relational Graph Recognition*, where the key-query channel encodes a directed graph and, given a context (a subset of the vertices), must recover the neighbors of each vertex in the context. We measure resources by the total key dimension $D_K = hd_k$. In a tractable multi-head model, we prove matching information-theoretic lower bounds and upper bounds via explicit constructions showing that recovering a graph with $m'$ relations in $d_{\text{model}}$-dimensional embeddings requires $D_K$ to grow essentially as $m'/d_{\text{model}}$ up to logarithmic factors, and we obtain corresponding guarantees for scaled-softmax attention. This analysis yields a new, capacity-based rationale for multi-head attention: even in permutation graphs, where **all queries attend to a single target**, splitting a fixed $D_K$ budget into multiple heads increases capacity by reducing interference from embedding superposition. Controlled experiments mirror the theory, revealing sharp phase transitions at the predicted capacity, and the multi-head advantage persists when adding softmax normalization, value routing, and a full Transformer block trained with frozen GPT-2 embeddings.
Successful Page Load