Timezone: »
Spotlight
Transformer Quality in Linear Time
Weizhe Hua · Zihang Dai · Hanxiao Liu · Quoc Le
We revisit the design choices in Transformers, and propose methods to address their weaknesses in handling long sequences. First, we propose a simple layer named gated attention unit, which allows the use of a weaker single-head attention with minimal quality loss. We then propose a linear approximation method complementary to this new layer, which is accelerator-friendly and highly competitive in quality. The resulting model, named FLASH, matches the perplexity of improved Transformers over both short (512) and long (8K) context lengths, achieving training speedups of up to 4.9x on Wiki-40B and 12.1x on PG-19 for auto-regressive language modeling, and 4.8x on C4 for masked language modeling.
Author Information
Weizhe Hua (Cornell University)
Zihang Dai
Hanxiao Liu (Google Brain)
Quoc Le (Google Brain)
Related Events (a corresponding poster, oral, or spotlight)
-
2022 Poster: Transformer Quality in Linear Time »
Thu. Jul 21st through Fri the 22nd Room Hall E #420
More from the Same Authors
-
2023 : DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining »
Sang Michael Xie · Hieu Pham · Xuanyi Dong · Nan Du · Hanxiao Liu · Yifeng Lu · Percy Liang · Quoc Le · Tengyu Ma · Adams Wei Yu -
2023 Poster: The Flan Collection: Designing Data and Methods for Effective Instruction Tuning »
Shayne Longpre · Le Hou · Tu Vu · Albert Webson · Hyung Won Chung · Yi Tay · Denny Zhou · Quoc Le · Barret Zoph · Jason Wei · Adam Roberts -
2023 Poster: Brainformers: Trading Simplicity for Efficiency »
Yanqi Zhou · Nan Du · Yanping Huang · Daiyi Peng · Chang Lan · Da Huang · Siamak Shakeri · David So · Andrew Dai · Yifeng Lu · Zhifeng Chen · Quoc Le · Claire Cui · James Laudon · Jeff Dean -
2022 Poster: GLaM: Efficient Scaling of Language Models with Mixture-of-Experts »
Nan Du · Yanping Huang · Andrew Dai · Simon Tong · Dmitry Lepikhin · Yuanzhong Xu · Maxim Krikun · Yanqi Zhou · Adams Wei Yu · Orhan Firat · Barret Zoph · William Fedus · Maarten Bosma · Zongwei Zhou · Tao Wang · Emma Wang · Kellie Webster · Marie Pellat · Kevin Robinson · Kathleen Meier-Hellstern · Toju Duke · Lucas Dixon · Kun Zhang · Quoc Le · Yonghui Wu · Zhifeng Chen · Claire Cui -
2022 Spotlight: GLaM: Efficient Scaling of Language Models with Mixture-of-Experts »
Nan Du · Yanping Huang · Andrew Dai · Simon Tong · Dmitry Lepikhin · Yuanzhong Xu · Maxim Krikun · Yanqi Zhou · Adams Wei Yu · Orhan Firat · Barret Zoph · William Fedus · Maarten Bosma · Zongwei Zhou · Tao Wang · Emma Wang · Kellie Webster · Marie Pellat · Kevin Robinson · Kathleen Meier-Hellstern · Toju Duke · Lucas Dixon · Kun Zhang · Quoc Le · Yonghui Wu · Zhifeng Chen · Claire Cui -
2021 Poster: Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision »
Chao Jia · Yinfei Yang · Ye Xia · Yi-Ting Chen · Zarana Parekh · Hieu Pham · Quoc Le · Yun-Hsuan Sung · Zhen Li · Tom Duerig -
2021 Oral: Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision »
Chao Jia · Yinfei Yang · Ye Xia · Yi-Ting Chen · Zarana Parekh · Hieu Pham · Quoc Le · Yun-Hsuan Sung · Zhen Li · Tom Duerig -
2021 Poster: EfficientNetV2: Smaller Models and Faster Training »
Mingxing Tan · Quoc Le -
2021 Poster: Towards Domain-Agnostic Contrastive Learning »
Vikas Verma · Thang Luong · Kenji Kawaguchi · Hieu Pham · Quoc Le -
2021 Spotlight: EfficientNetV2: Smaller Models and Faster Training »
Mingxing Tan · Quoc Le -
2021 Spotlight: Towards Domain-Agnostic Contrastive Learning »
Vikas Verma · Thang Luong · Kenji Kawaguchi · Hieu Pham · Quoc Le -
2020 Poster: Go Wide, Then Narrow: Efficient Training of Deep Thin Networks »
Denny Zhou · Mao Ye · Chen Chen · Tianjian Meng · Mingxing Tan · Xiaodan Song · Quoc Le · Qiang Liu · Dale Schuurmans -
2020 Poster: AutoML-Zero: Evolving Machine Learning Algorithms From Scratch »
Esteban Real · Chen Liang · David So · Quoc Le -
2019 Poster: EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks »
Mingxing Tan · Quoc Le -
2019 Poster: The Effect of Network Width on Stochastic Gradient Descent and Generalization: an Empirical Study »
Daniel Park · Jascha Sohl-Dickstein · Quoc Le · Samuel Smith -
2019 Poster: The Evolved Transformer »
David So · Quoc Le · Chen Liang -
2019 Oral: The Evolved Transformer »
David So · Quoc Le · Chen Liang -
2019 Oral: EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks »
Mingxing Tan · Quoc Le -
2019 Oral: The Effect of Network Width on Stochastic Gradient Descent and Generalization: an Empirical Study »
Daniel Park · Jascha Sohl-Dickstein · Quoc Le · Samuel Smith -
2018 Poster: Understanding and Simplifying One-Shot Architecture Search »
Gabriel Bender · Pieter-Jan Kindermans · Barret Zoph · Vijay Vasudevan · Quoc Le -
2018 Poster: Learning Longer-term Dependencies in RNNs with Auxiliary Losses »
Trieu H Trinh · Andrew Dai · Thang Luong · Quoc Le -
2018 Oral: Learning Longer-term Dependencies in RNNs with Auxiliary Losses »
Trieu H Trinh · Andrew Dai · Thang Luong · Quoc Le -
2018 Oral: Understanding and Simplifying One-Shot Architecture Search »
Gabriel Bender · Pieter-Jan Kindermans · Barret Zoph · Vijay Vasudevan · Quoc Le -
2018 Poster: Can Deep Reinforcement Learning Solve Erdos-Selfridge-Spencer Games? »
Maithra Raghu · Alexander Irpan · Jacob Andreas · Bobby Kleinberg · Quoc Le · Jon Kleinberg -
2018 Oral: Can Deep Reinforcement Learning Solve Erdos-Selfridge-Spencer Games? »
Maithra Raghu · Alexander Irpan · Jacob Andreas · Bobby Kleinberg · Quoc Le · Jon Kleinberg -
2018 Poster: Efficient Neural Architecture Search via Parameters Sharing »
Hieu Pham · Melody Guan · Barret Zoph · Quoc Le · Jeff Dean -
2018 Oral: Efficient Neural Architecture Search via Parameters Sharing »
Hieu Pham · Melody Guan · Barret Zoph · Quoc Le · Jeff Dean -
2017 Poster: Large-Scale Evolution of Image Classifiers »
Esteban Real · Sherry Moore · Andrew Selle · Saurabh Saxena · Yutaka Leon Suematsu · Jie Tan · Quoc Le · Alexey Kurakin -
2017 Poster: Neural Optimizer Search using Reinforcement Learning »
Irwan Bello · Barret Zoph · Vijay Vasudevan · Quoc Le -
2017 Poster: Device Placement Optimization with Reinforcement Learning »
Azalia Mirhoseini · Hieu Pham · Quoc Le · benoit steiner · Mohammad Norouzi · Rasmus Larsen · Yuefeng Zhou · Naveen Kumar · Samy Bengio · Jeff Dean -
2017 Talk: Neural Optimizer Search using Reinforcement Learning »
Irwan Bello · Barret Zoph · Vijay Vasudevan · Quoc Le -
2017 Talk: Large-Scale Evolution of Image Classifiers »
Esteban Real · Sherry Moore · Andrew Selle · Saurabh Saxena · Yutaka Leon Suematsu · Jie Tan · Quoc Le · Alexey Kurakin -
2017 Talk: Device Placement Optimization with Reinforcement Learning »
Azalia Mirhoseini · Hieu Pham · Quoc Le · benoit steiner · Mohammad Norouzi · Rasmus Larsen · Yuefeng Zhou · Naveen Kumar · Samy Bengio · Jeff Dean