Timezone: »

On the Training and Generalization Dynamics of Multi-head Attention
Puneesh Deora · Rouzbeh Ghaderi · Hossein Taheri · Christos Thrampoulidis

The training and generalization dynamics of Transformers' core mechanism, the Attention mechanism, remain under-explored. Besides, existing analyses primarily focus on single-head attention. Inspired by the demonstrated benefits of learning with overparameterization when training multilayer-perceptrons, we explore the potential optimization and generalization advantages of using multiple attention heads. Towards this goal, we derive convergence and generalization guarantees for gradient-descent training of a single-layer multi-head self-attention model, under a suitable realizability condition. Furthermore, we establish primitive conditions on the initialization that ensure the realizability condition holds. As a specific case study, we demonstrate that these conditions are satisfied for a simple yet representative tokenized mixture model. Overall, the analysis presented is quite general and has the potential to be extended to various data-model and architecture variations.

Author Information

Puneesh Deora (Indian Statistical Institute)

I am a Visting Researcher in the CVPR unit, Indian Statistical Institute Kolkata. I completed my Bachelor's in ECE from the Indian Institute of Technology Roorkee in 2020. My areas of interest broadly lie in machine learning theory, optimization, adversarial robustness. I am currently working on improving deep metric learning approaches via optimal negative vectors in the embedding space. In the past, I have worked on generative modeling for compressive sensing image reconstruction, zero-shot learning for action recognition in videos, signal processing for fetal heart rate monitoring.

Rouzbeh Ghaderi (University of British Columbia)
Hossein Taheri (UC Santa Barbara)
Christos Thrampoulidis (University of British Columbia)

More from the Same Authors