Timezone: »
The training and generalization dynamics of Transformers' core mechanism, the Attention mechanism, remain under-explored. Besides, existing analyses primarily focus on single-head attention. Inspired by the demonstrated benefits of learning with overparameterization when training multilayer-perceptrons, we explore the potential optimization and generalization advantages of using multiple attention heads. Towards this goal, we derive convergence and generalization guarantees for gradient-descent training of a single-layer multi-head self-attention model, under a suitable realizability condition. Furthermore, we establish primitive conditions on the initialization that ensure the realizability condition holds. As a specific case study, we demonstrate that these conditions are satisfied for a simple yet representative tokenized mixture model. Overall, the analysis presented is quite general and has the potential to be extended to various data-model and architecture variations.
Author Information
Puneesh Deora (Indian Statistical Institute)
I am a Visting Researcher in the CVPR unit, Indian Statistical Institute Kolkata. I completed my Bachelor's in ECE from the Indian Institute of Technology Roorkee in 2020. My areas of interest broadly lie in machine learning theory, optimization, adversarial robustness. I am currently working on improving deep metric learning approaches via optimal negative vectors in the embedding space. In the past, I have worked on generative modeling for compressive sensing image reconstruction, zero-shot learning for action recognition in videos, signal processing for fetal heart rate monitoring.
Rouzbeh Ghaderi (University of British Columbia)
Hossein Taheri (UC Santa Barbara)
Christos Thrampoulidis (University of British Columbia)
More from the Same Authors
-
2021 : Benign Overfitting in Multiclass Classification: All Roads Lead to Interpolation »
Ke Wang · Vidya Muthukumar · Christos Thrampoulidis -
2021 : Binary Classification of Gaussian Mixtures: Abundance of Support Vectors, Benign Overfitting and Regularization »
Ke Wang · Christos Thrampoulidis -
2021 : Label-Imbalanced and Group-Sensitive Classification under Overparameterization »
Ganesh Ramachandra Kini · Orestis Paraskevas · Samet Oymak · Christos Thrampoulidis -
2023 : Generalization and Stability of Interpolating Neural Networks with Minimal Width »
Hossein Taheri · Christos Thrampoulidis -
2023 : Supervised-Contrastive Loss Learns Orthogonal Frames and Batching Matters »
Ganesh Ramachandra Kini · Vala Vakilian · Tina Behnia · Jaidev Gill · Christos Thrampoulidis -
2023 : Fast Test Error Rates for Gradient-based Algorithms on Separable Data »
Puneesh Deora · Bhavya Vasudeva · Vatsal Sharan · Christos Thrampoulidis -
2023 Poster: On the Role of Attention in Prompt-tuning »
Samet Oymak · Ankit Singh Rawat · Mahdi Soltanolkotabi · Christos Thrampoulidis -
2022 Poster: FedNest: Federated Bilevel, Minimax, and Compositional Optimization »
Davoud Ataee Tarzanagh · Mingchen Li · Christos Thrampoulidis · Samet Oymak -
2022 Oral: FedNest: Federated Bilevel, Minimax, and Compositional Optimization »
Davoud Ataee Tarzanagh · Mingchen Li · Christos Thrampoulidis · Samet Oymak -
2021 Poster: Safe Reinforcement Learning with Linear Function Approximation »
Sanae Amani · Christos Thrampoulidis · Lin Yang -
2021 Spotlight: Safe Reinforcement Learning with Linear Function Approximation »
Sanae Amani · Christos Thrampoulidis · Lin Yang -
2020 Poster: Quantized Decentralized Stochastic Learning over Directed Graphs »
Hossein Taheri · Aryan Mokhtari · Hamed Hassani · Ramtin Pedarsani