Induction Heads Interpolate N-Grams
Francesco D'Angelo ⋅ Oğuz Yüksel ⋅ Swathi Narashiman ⋅ Nicolas Flammarion
Abstract
Induction heads are attention circuits believed to underlie in-context learning in transformers, yet a precise characterization of the estimators they implement remains elusive. We study transformers trained on order-$k$ Markov chains and prove that a two-layer disentangled transformer implements a soft context-matching estimator that aggregates contributions from all partial context matches, weighted exponentially by their degree of overlap. This mechanism admits two complementary smoothing interpretations. First, prepending a beginning-of-sequence (BOS) token induces additive pseudo-counts, recovering Dirichlet-style smoothing. Second, a finite attention temperature enables interpolation across context orders, analogous to Jelinek–Mercer smoothing but with data-dependent weights that adapt to each sequence's local structure. Experiments on trained transformers confirm that learned attention patterns match our theoretical construction and approach Bayes-optimal performance, where hard counting fails. Our results bridge mechanistic interpretability of induction heads with classical statistical smoothing, revealing that transformers learn to regularize in-context estimation rather than simply count.
Successful Page Load