Attention's forward pass and Frank-Wolfe
Albert Alcalde ⋅ Borjan Geshkovski ⋅ Domènec Ruiz-Balet
Abstract
We analyze the hardmax limit of self-attention dynamics for token embeddings in the zero-temperature regime $(\beta \to +\infty)$ and relate it to finite-$\beta$ behavior. In this limit, the update rule can be viewed as a Frank-Wolfe step for a quadratic objective over the convex hull of the current tokens. When the key-query matrix is negative semidefinite, the dynamics converge with the standard sublinear rate $\mathcal{O}(t^{-1})$ on the quadratic energy, whereas in the positive semidefinite case, extending the hardmax rule to the convex hull induces a Voronoi structure: vertices are stationary, interior points remain in their initial cells, and each token moves along a straight line toward its cell's vertex with exponential convergence under a step-size bounded away from zero. We additionally establish well-posedness of the associated ODE limit in this regime. For finite $\beta$, we model self-attention as a Markov chain and prove *dynamic metastability*: interior tokens reach near-vertex configurations in a constant number of steps and remain trapped for times exponential in $\beta$ with high probability, before eventual collapse to some point within the initial convex hull. Thus, hardmax dynamics accurately approximate the finite-$\beta$ process over exponentially long time horizons.
Successful Page Load