Poster
in
Workshop: 3rd Workshop on High-dimensional Learning Dynamics (HiLD)

Training Dynamics of In-Context Learning in Linear Attention

Yedi Zhang ⋅ Aaditya Singh ⋅ Peter Latham ⋅ Andrew Saxe

Project Page [ OpenReview]

Abstract

While attention-based models have demonstrated the remarkable ability of in-context learning (ICL), the theoretical understanding of how these models acquired this ability through gradient descent training is still preliminary. Towards answering this question, we study the gradient descent dynamics of multi-head linear self-attention trained for in-context linear regression. We show that the training dynamics has exponentially many fixed points and the loss exhibits saddle-to-saddle dynamics, which we reduce to scalar ordinary differential equations. During training, the model implements principal component regression in context with the number of principal components increasing over training time. Overall, we provide a theoretical description of how ICL abilities progressively improve during the gradient descent training of multi-head linear self-attention.

Chat is not available.