Attention Implements the Fisher Geometry of Exponential Families
Abstract
Softmax attention is increasingly treated as a reusable inference primitive in transformers. Much prior theory covers Gaussian/linear models or assumes one shared quadratic query--key metric, which can fail for non-Gaussian exponential-family likelihoods with varying curvature. This risks overreading learned similarities as a global metric or as proof of Bayes-optimality. For discrete latent symbols with exponential-family observations, we show a single head can implement Bayes posteriors (and posterior means) by setting logits to log prior plus log likelihood, and we characterize single-head posteriors as exactly the log-linear (exponential-family) class. Using convex duality, we rewrite log-likelihoods as negative Bregman divergences on mean/sufficient-statistic space, making Bayes rule a soft nearest-neighbor computation; this yields a sharp boundary for globally shared quadratic metrics and a multi-head curvature-atlas approximation with head-count scaling, and we extend these guarantees to in-context estimation with consistency and finite-sample stability bounds. In synthetic Gaussian and Bernoulli in-context estimation, trained minimal attention models validate these predictions: performance approaches a Bayes-oracle baseline as prompt length grows; learned metrics align with noise precision in the Gaussian case, while Bernoulli retains a gap consistent with curvature variation. Together, these results explain when Fisher geometry should emerge, when a single metric is justified, and when multiple heads are necessary for Bayes-like in-context estimators beyond Gaussians.