Timezone: »
Multi-head attention is a driving force behind state-of-the-art transformers, which achieve remarkable performance across a variety of natural language processing (NLP) and computer vision tasks. It has been observed that for many applications, those attention heads learn redundant embedding, and most of them can be removed without degrading the performance of the model. Inspired by this observation, we propose Transformer with a Mixture of Gaussian Keys (Transformer-MGK), a novel transformer architecture that replaces redundant heads in transformers with a mixture of keys at each head. These mixtures of keys follow a Gaussian mixture model and allow each attention head to focus on different parts of the input sequence efficiently. Compared to its conventional transformer counterpart, Transformer-MGK accelerates training and inference, has fewer parameters, and requires fewer FLOPs to compute while achieving comparable or better accuracy across tasks. Transformer-MGK can also be easily extended to use with linear attention. We empirically demonstrate the advantage of Transformer-MGK in a range of practical applications, including language modeling and tasks that involve very long sequences. On the Wikitext-103 and Long Range Arena benchmark, Transformer-MGKs with 4 heads attain comparable or better performance to the baseline transformers with 8 heads.
Author Information
Tam Nguyen (FPT Software Company Limited, FPT Cau Giay building, Duy Tan street, Dich Vong Hau ward, Cau Giay district, Hanoi)
Tan Nguyen (University of California, Los Angeles)

I am currently a postdoctoral scholar in the Department of Mathematics at the University of California, Los Angeles, working with Dr. Stanley J. Osher. I have obtained my Ph.D. in Machine Learning from Rice University, where I was advised by Dr. Richard G. Baraniuk. My research is focused on the intersection of Deep Learning, Probabilistic Modeling, Optimization, and ODEs/PDEs. I gave an invited talk in the Deep Learning Theory Workshop at NeurIPS 2018 and organized the 1st Workshop on Integration of Deep Neural Models and Differential Equations at ICLR 2020. I also had two awesome long internships with Amazon AI and NVIDIA Research, during which I worked with Dr. Anima Anandkumar. I am the recipient of the prestigious Computing Innovation Postdoctoral Fellowship (CIFellows) from the Computing Research Association (CRA), the NSF Graduate Research Fellowship, and the IGERT Neuroengineering Traineeship. I received my M.S. and B.S. in Electrical and Computer Engineering from Rice in May 2018 and May 2014, respectively. I am also writing about simple ideas and principled approach that lead to working machine learning algorithms on my blog, Almost Convergent.
Dung Le (College of Engineering and Computer Science, VinUniversity)
Duy Khuong Nguyen (FPT Software Company Limited)
Viet-Anh Tran (Deezer)
Richard Baraniuk (OpenStax / Rice University)
Nhat Ho (University of Texas at Austin)
Stanley Osher (UCLA)
Related Events (a corresponding poster, oral, or spotlight)
-
2022 Spotlight: Improving Transformers with Probabilistic Attention Keys »
Thu. Jul 21st 08:45 -- 08:50 PM Room Ballroom 1 & 2
More from the Same Authors
-
2023 : Provable Instance Specific Robustness via Linear Constraints »
Ahmed Imtiaz Humayun · Josue Casco-Rodriguez · Randall Balestriero · Richard Baraniuk -
2023 : Fast Approximation of the Generalized Sliced-Wasserstein Distance »
Dung Le · Huy Nguyen · Khai Nguyen · Nhat Ho -
2023 Poster: Revisiting Over-smoothing and Over-squashing Using Ollivier-Ricci Curvature »
Khang Nguyen · Nong Hieu · Vinh NGUYEN · Nhat Ho · Stanley Osher · TAN NGUYEN -
2023 Poster: On Excess Mass Behavior in Gaussian Mixture Models with Orlicz-Wasserstein Distances »
Aritra Guha · Nhat Ho · XuanLong Nguyen -
2023 Poster: Self-Attention Amortized Distributional Projection Optimization for Sliced Wasserstein Point-Cloud Reconstruction »
Khai Nguyen · Dang Nguyen · Nhat Ho -
2023 Poster: Neural Collapse in Deep Linear Networks: From Balanced to Imbalanced Data »
Hien Dang · Tho Tran Huu · Stanley Osher · Hung Tran-The · Nhat Ho · TAN NGUYEN -
2022 Poster: Entropic Gromov-Wasserstein between Gaussian Distributions »
Khang Le · Dung Le · Huy Nguyen · · Tung Pham · Nhat Ho -
2022 Spotlight: Entropic Gromov-Wasserstein between Gaussian Distributions »
Khang Le · Dung Le · Huy Nguyen · · Tung Pham · Nhat Ho -
2022 Poster: On Transportation of Mini-batches: A Hierarchical Approach »
Khai Nguyen · Dang Nguyen · Quoc Nguyen · Tung Pham · Hung Bui · Dinh Phung · Trung Le · Nhat Ho -
2022 Poster: Architecture Agnostic Federated Learning for Neural Networks »
Disha Makhija · Xing Han · Nhat Ho · Joydeep Ghosh -
2022 Poster: Improving Mini-batch Optimal Transport via Partial Transportation »
Khai Nguyen · Dang Nguyen · The-Anh Vu-Le · Tung Pham · Nhat Ho -
2022 Spotlight: Architecture Agnostic Federated Learning for Neural Networks »
Disha Makhija · Xing Han · Nhat Ho · Joydeep Ghosh -
2022 Spotlight: Improving Mini-batch Optimal Transport via Partial Transportation »
Khai Nguyen · Dang Nguyen · The-Anh Vu-Le · Tung Pham · Nhat Ho -
2022 Spotlight: On Transportation of Mini-batches: A Hierarchical Approach »
Khai Nguyen · Dang Nguyen · Quoc Nguyen · Tung Pham · Hung Bui · Dinh Phung · Trung Le · Nhat Ho -
2022 Poster: Refined Convergence Rates for Maximum Likelihood Estimation under Finite Mixture Models »
Tudor Manole · Nhat Ho -
2022 Oral: Refined Convergence Rates for Maximum Likelihood Estimation under Finite Mixture Models »
Tudor Manole · Nhat Ho -
2021 Poster: LAMDA: Label Matching Deep Domain Adaptation »
Trung Le · Tuan Nguyen · Nhat Ho · Hung Bui · Dinh Phung -
2021 Spotlight: LAMDA: Label Matching Deep Domain Adaptation »
Trung Le · Tuan Nguyen · Nhat Ho · Hung Bui · Dinh Phung -
2020 Poster: Subspace Fitting Meets Regression: The Effects of Supervision and Orthonormality Constraints on Double Descent of Generalization Errors »
Yehuda Dar · Paul Mayer · Lorenzo Luzi · Richard Baraniuk -
2020 Poster: Sub-linear Memory Sketches for Near Neighbor Search on Streaming Data »
Benjamin Coleman · Richard Baraniuk · Anshumali Shrivastava -
2018 Poster: Ultra Large-Scale Feature Selection using Count-Sketches »
Amirali Aghazadeh · Ryan Spring · Daniel LeJeune · Gautam Dasarathy · Anshumali Shrivastava · Richard Baraniuk -
2018 Poster: A Spline Theory of Deep Learning »
Randall Balestriero · Richard Baraniuk -
2018 Poster: prDeep: Robust Phase Retrieval with a Flexible Deep Network »
Christopher Metzler · Phillip Schniter · Ashok Veeraraghavan · Richard Baraniuk -
2018 Oral: prDeep: Robust Phase Retrieval with a Flexible Deep Network »
Christopher Metzler · Phillip Schniter · Ashok Veeraraghavan · Richard Baraniuk -
2018 Oral: Ultra Large-Scale Feature Selection using Count-Sketches »
Amirali Aghazadeh · Ryan Spring · Daniel LeJeune · Gautam Dasarathy · Anshumali Shrivastava · Richard Baraniuk -
2018 Oral: A Spline Theory of Deep Learning »
Randall Balestriero · Richard Baraniuk -
2018 Poster: Spline Filters For End-to-End Deep Learning »
Randall Balestriero · Romain Cosentino · Herve Glotin · Richard Baraniuk -
2018 Oral: Spline Filters For End-to-End Deep Learning »
Randall Balestriero · Romain Cosentino · Herve Glotin · Richard Baraniuk