Poster
in
Workshop: HiLD: High-dimensional Learning Dynamics Workshop
On the Training and Generalization Dynamics of Multi-head Attention
Puneesh Deora · Rouzbeh Ghaderi · Hossein Taheri · Christos Thrampoulidis
The training and generalization dynamics of Transformers' core mechanism, the Attention mechanism, remain under-explored. Besides, existing analyses primarily focus on single-head attention. Inspired by the demonstrated benefits of learning with overparameterization when training multilayer-perceptrons, we explore the potential optimization and generalization advantages of using multiple attention heads. Towards this goal, we derive convergence and generalization guarantees for gradient-descent training of a single-layer multi-head self-attention model, under a suitable realizability condition. Furthermore, we establish primitive conditions on the initialization that ensure the realizability condition holds. As a specific case study, we demonstrate that these conditions are satisfied for a simple yet representative tokenized mixture model. Overall, the analysis presented is quite general and has the potential to be extended to various data-model and architecture variations.