Timezone: »

Improving Transformers with Probabilistic Attention Keys
Tam Nguyen · Tan Nguyen · Dung Le · Duy Khuong Nguyen · Viet-Anh Tran · Richard Baraniuk · Nhat Ho · Stanley Osher

Thu Jul 21 01:45 PM -- 01:50 PM (PDT) @ Ballroom 1 & 2

Multi-head attention is a driving force behind state-of-the-art transformers, which achieve remarkable performance across a variety of natural language processing (NLP) and computer vision tasks. It has been observed that for many applications, those attention heads learn redundant embedding, and most of them can be removed without degrading the performance of the model. Inspired by this observation, we propose Transformer with a Mixture of Gaussian Keys (Transformer-MGK), a novel transformer architecture that replaces redundant heads in transformers with a mixture of keys at each head. These mixtures of keys follow a Gaussian mixture model and allow each attention head to focus on different parts of the input sequence efficiently. Compared to its conventional transformer counterpart, Transformer-MGK accelerates training and inference, has fewer parameters, and requires fewer FLOPs to compute while achieving comparable or better accuracy across tasks. Transformer-MGK can also be easily extended to use with linear attention. We empirically demonstrate the advantage of Transformer-MGK in a range of practical applications, including language modeling and tasks that involve very long sequences. On the Wikitext-103 and Long Range Arena benchmark, Transformer-MGKs with 4 heads attain comparable or better performance to the baseline transformers with 8 heads.

Author Information

Tam Nguyen (FPT Software Company Limited, FPT Cau Giay building, Duy Tan street, Dich Vong Hau ward, Cau Giay district, Hanoi)
Tan Nguyen (University of California, Los Angeles)
Tan Nguyen

I am currently a postdoctoral scholar in the Department of Mathematics at the University of California, Los Angeles, working with Dr. Stanley J. Osher. I have obtained my Ph.D. in Machine Learning from Rice University, where I was advised by Dr. Richard G. Baraniuk. My research is focused on the intersection of Deep Learning, Probabilistic Modeling, Optimization, and ODEs/PDEs. I gave an invited talk in the Deep Learning Theory Workshop at NeurIPS 2018 and organized the 1st Workshop on Integration of Deep Neural Models and Differential Equations at ICLR 2020. I also had two awesome long internships with Amazon AI and NVIDIA Research, during which I worked with Dr. Anima Anandkumar. I am the recipient of the prestigious Computing Innovation Postdoctoral Fellowship (CIFellows) from the Computing Research Association (CRA), the NSF Graduate Research Fellowship, and the IGERT Neuroengineering Traineeship. I received my M.S. and B.S. in Electrical and Computer Engineering from Rice in May 2018 and May 2014, respectively. I am also writing about simple ideas and principled approach that lead to working machine learning algorithms on my blog, Almost Convergent.

Dung Le (College of Engineering and Computer Science, VinUniversity)
Duy Khuong Nguyen (FPT Software Company Limited)
Viet-Anh Tran (Deezer)
Richard Baraniuk (OpenStax / Rice University)
Nhat Ho (University of Texas at Austin)
Stanley Osher (UCLA)

Related Events (a corresponding poster, oral, or spotlight)

More from the Same Authors