Timezone: »
Scaling language models with more data, compute and parameters has driven significant progress in natural language processing. For example, thanks to scaling, GPT-3 was able to achieve strong results on in-context learning tasks. However, training these large dense models requires significant amounts of computing resources. In this paper, we propose and develop a family of language models named \glam (\textbf{G}eneralist \textbf{La}nguage \textbf{M}odel), which uses a sparsely activated mixture-of-experts architecture to scale the model capacity while also incurring substantially less training cost compared to dense variants. The largest \glam has 1.2 trillion parameters, which is approximately 7x larger than GPT-3. It consumes only 1/3 of the energy used to train GPT-3 and requires half of the computation flops for inference, while still achieving better overall fewshot performance across 29 NLP tasks.
Author Information
Nan Du (Google)
Yanping Huang (Google Brain)
Andrew Dai (Google)
Andrew Dai was awarded an MA in Computer Science at the University of Cambridge before receiving a PhD in Informatics at the University of Edinburgh for text modeling with Bayesian nonparametrics. He then subsequently worked at Google in Mountain View, California in a range of teams including machine translation, Google Now and Google Ads. In 2014, he joined the Google Brain team where he has worked on text representations, semi-supervised learning, sequence models, adversarial training and deep learning on medical data.
Simon Tong (Google Brain)
Dmitry Lepikhin (Google)
Yuanzhong Xu (Google)
Maxim Krikun (Google)
Yanqi Zhou (Google)
Adams Wei Yu (Google Brain)
Orhan Firat (Google)
Barret Zoph (Google)
William Fedus (Google Brain)
Maarten Bosma (Google)
Zongwei Zhou (Google Inc.)
Tao Wang (Google Inc.)
Emma Wang (Google)
Kellie Webster (Google)
Marie Pellat (Google)
Kevin Robinson (Google)
Kathleen Meier-Hellstern (Google)
Kathy is a Principal Engineer and Director in Google Research, serving as the Responsible AI Tech Lead for Google’s large language and multimodal models. Her research mission is to create scalable tools, data and processes for evaluating and improving RAI in ML Models and Products. Kathy was previously a Principal Site Reliability Engineer at Google, focused on improving the end-to-end client experience in YouTube and Ads. Before joining Google, Kathy was Assistant Vice President of Optimization, Reliability & Customer Analytics (ORCA) in AT&T Labs, responsible for delivering enhanced analytic tools and software for AT&T’s Next Generation networks. Kathy is an AT&T Fellow, and holds a Ph.D. and Master’s degree in Operations Research from University of Delaware.
Toju Duke (Google)
Lucas Dixon (Google)
Kun Zhang (Google)
Quoc Le (Google Brain)
Yonghui Wu (Google)
Zhifeng Chen (Google)
Claire Cui (Google)
Related Events (a corresponding poster, oral, or spotlight)
-
2022 Spotlight: GLaM: Efficient Scaling of Language Models with Mixture-of-Experts »
Thu. Jul 21st 07:30 -- 07:35 PM Room Hall G
More from the Same Authors
-
2023 : DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining »
Sang Michael Xie · Hieu Pham · Xuanyi Dong · Nan Du · Hanxiao Liu · Yifeng Lu · Percy Liang · Quoc Le · Tengyu Ma · Adams Wei Yu -
2023 : RewriteLM: An Instruction-Tuned Large Language Model for Text Rewriting »
Liangchen Luo · Lei Shu · Jayakumar Hoskere · Yun Zhu · Canoee Liu · Simon Tong · Jindong Chen · Lei Meng -
2023 : Interactive-Chain-Prompting: Ambiguity Resolution for Crosslingual Conditional Generation with Interaction »
Jonathan Pilault · Xavier Garcia · Arthur Brazinskas · Orhan Firat -
2023 : Learning Large Graph Property Prediction via Graph Segment Training »
Kaidi Cao · Phitchaya Phothilimthana · Sami Abu-El-Haija · Dustin Zelle · Yanqi Zhou · Charith Mendis · Jure Leskovec · Bryan Perozzi -
2023 Poster: The Unreasonable Effectiveness of Few-shot Learning for Machine Translation »
Xavier Garcia · Yamini Bansal · Colin Cherry · George Foster · Maxim Krikun · Melvin Johnson · Orhan Firat -
2023 Poster: Scaling Laws for Multilingual Neural Machine Translation »
Patrick Fernandes · Behrooz Ghorbani · Xavier Garcia · Markus Freitag · Orhan Firat -
2023 Poster: Underspecification Presents Challenges for Credibility in Modern Machine Learning »
Alexander D'Amour · Katherine Heller · Dan Moldovan · Ben Adlam · Babak Alipanahi · Alex Beutel · Christina Chen · Jonathan Deaton · Jacob Eisenstein · Matthew Hoffman · Farhad Hormozdiari · Neil Houlsby · Shaobo Hou · Ghassen Jerfel · Alan Karthikesalingam · Mario Lucic · Yian Ma · Cory McLean · Diana Mincu · Akinori Mitani · Andrea Montanari · Zachary Nado · Vivek Natarajan · Christopher Nielson · Thomas F. Osborne · Rajiv Raman · Kim Ramasamy · Rory sayres · Jessica Schrouff · Martin Seneviratne · Shannon Sequeira · Harini Suresh · Victor Veitch · Maksym Vladymyrov · Xuezhi Wang · Kellie Webster · Steve Yadlowsky · Taedong Yun · Xiaohua Zhai · D. Sculley -
2023 Poster: The Flan Collection: Designing Data and Methods for Effective Instruction Tuning »
Shayne Longpre · Le Hou · Tu Vu · Albert Webson · Hyung Won Chung · Yi Tay · Denny Zhou · Quoc Le · Barret Zoph · Jason Wei · Adam Roberts -
2023 Poster: Lifelong Language Pretraining with Distribution-Specialized Experts »
Wuyang Chen · Yanqi Zhou · Nan Du · Yanping Huang · James Laudon · Zhifeng Chen · Claire Cui -
2023 Poster: Brainformers: Trading Simplicity for Efficiency »
Yanqi Zhou · Nan Du · Yanping Huang · Daiyi Peng · Chang Lan · Da Huang · Siamak Shakeri · David So · Andrew Dai · Yifeng Lu · Zhifeng Chen · Quoc Le · Claire Cui · James Laudon · Jeff Dean -
2022 Poster: Transformer Quality in Linear Time »
Weizhe Hua · Zihang Dai · Hanxiao Liu · Quoc Le -
2022 Poster: Self-supervised learning with random-projection quantizer for speech recognition »
Chung-Cheng Chiu · James Qin · Yu Zhang · Jiahui Yu · Yonghui Wu -
2022 Poster: Data Scaling Laws in NMT: The Effect of Noise and Architecture »
Yamini Bansal · Behrooz Ghorbani · Ankush Garg · Biao Zhang · Colin Cherry · Behnam Neyshabur · Orhan Firat -
2022 Spotlight: Self-supervised learning with random-projection quantizer for speech recognition »
Chung-Cheng Chiu · James Qin · Yu Zhang · Jiahui Yu · Yonghui Wu -
2022 Spotlight: Data Scaling Laws in NMT: The Effect of Noise and Architecture »
Yamini Bansal · Behrooz Ghorbani · Ankush Garg · Biao Zhang · Colin Cherry · Behnam Neyshabur · Orhan Firat -
2022 Spotlight: Transformer Quality in Linear Time »
Weizhe Hua · Zihang Dai · Hanxiao Liu · Quoc Le -
2022 Poster: Examining Scaling and Transfer of Language Model Architectures for Machine Translation »
Biao Zhang · Behrooz Ghorbani · Ankur Bapna · Yong Cheng · Xavier Garcia · Jonathan Shen · Orhan Firat -
2022 Spotlight: Examining Scaling and Transfer of Language Model Architectures for Machine Translation »
Biao Zhang · Behrooz Ghorbani · Ankur Bapna · Yong Cheng · Xavier Garcia · Jonathan Shen · Orhan Firat -
2021 Poster: Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision »
Chao Jia · Yinfei Yang · Ye Xia · Yi-Ting Chen · Zarana Parekh · Hieu Pham · Quoc Le · Yun-Hsuan Sung · Zhen Li · Tom Duerig -
2021 Oral: Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision »
Chao Jia · Yinfei Yang · Ye Xia · Yi-Ting Chen · Zarana Parekh · Hieu Pham · Quoc Le · Yun-Hsuan Sung · Zhen Li · Tom Duerig -
2021 Poster: EfficientNetV2: Smaller Models and Faster Training »
Mingxing Tan · Quoc Le -
2021 Poster: Towards Domain-Agnostic Contrastive Learning »
Vikas Verma · Thang Luong · Kenji Kawaguchi · Hieu Pham · Quoc Le -
2021 Spotlight: EfficientNetV2: Smaller Models and Faster Training »
Mingxing Tan · Quoc Le -
2021 Spotlight: Towards Domain-Agnostic Contrastive Learning »
Vikas Verma · Thang Luong · Kenji Kawaguchi · Hieu Pham · Quoc Le -
2020 Poster: Go Wide, Then Narrow: Efficient Training of Deep Thin Networks »
Denny Zhou · Mao Ye · Chen Chen · Tianjian Meng · Mingxing Tan · Xiaodan Song · Quoc Le · Qiang Liu · Dale Schuurmans -
2020 Poster: XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalisation »
Junjie Hu · Sebastian Ruder · Aditya Siddhant · Graham Neubig · Orhan Firat · Melvin Johnson -
2020 Poster: AutoML-Zero: Evolving Machine Learning Algorithms From Scratch »
Esteban Real · Chen Liang · David So · Quoc Le -
2019 Poster: EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks »
Mingxing Tan · Quoc Le -
2019 Poster: The Effect of Network Width on Stochastic Gradient Descent and Generalization: an Empirical Study »
Daniel Park · Jascha Sohl-Dickstein · Quoc Le · Samuel Smith -
2019 Poster: The Evolved Transformer »
David So · Quoc Le · Chen Liang -
2019 Oral: The Evolved Transformer »
David So · Quoc Le · Chen Liang -
2019 Oral: EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks »
Mingxing Tan · Quoc Le -
2019 Oral: The Effect of Network Width on Stochastic Gradient Descent and Generalization: an Empirical Study »
Daniel Park · Jascha Sohl-Dickstein · Quoc Le · Samuel Smith -
2018 Poster: Understanding and Simplifying One-Shot Architecture Search »
Gabriel Bender · Pieter-Jan Kindermans · Barret Zoph · Vijay Vasudevan · Quoc Le -
2018 Poster: Learning Longer-term Dependencies in RNNs with Auxiliary Losses »
Trieu H Trinh · Andrew Dai · Thang Luong · Quoc Le -
2018 Oral: Learning Longer-term Dependencies in RNNs with Auxiliary Losses »
Trieu H Trinh · Andrew Dai · Thang Luong · Quoc Le -
2018 Oral: Understanding and Simplifying One-Shot Architecture Search »
Gabriel Bender · Pieter-Jan Kindermans · Barret Zoph · Vijay Vasudevan · Quoc Le -
2018 Poster: Can Deep Reinforcement Learning Solve Erdos-Selfridge-Spencer Games? »
Maithra Raghu · Alexander Irpan · Jacob Andreas · Bobby Kleinberg · Quoc Le · Jon Kleinberg -
2018 Oral: Can Deep Reinforcement Learning Solve Erdos-Selfridge-Spencer Games? »
Maithra Raghu · Alexander Irpan · Jacob Andreas · Bobby Kleinberg · Quoc Le · Jon Kleinberg -
2018 Poster: Efficient Neural Architecture Search via Parameters Sharing »
Hieu Pham · Melody Guan · Barret Zoph · Quoc Le · Jeff Dean -
2018 Oral: Efficient Neural Architecture Search via Parameters Sharing »
Hieu Pham · Melody Guan · Barret Zoph · Quoc Le · Jeff Dean -
2017 Poster: Large-Scale Evolution of Image Classifiers »
Esteban Real · Sherry Moore · Andrew Selle · Saurabh Saxena · Yutaka Leon Suematsu · Jie Tan · Quoc Le · Alexey Kurakin -
2017 Poster: Neural Optimizer Search using Reinforcement Learning »
Irwan Bello · Barret Zoph · Vijay Vasudevan · Quoc Le -
2017 Poster: Device Placement Optimization with Reinforcement Learning »
Azalia Mirhoseini · Hieu Pham · Quoc Le · benoit steiner · Mohammad Norouzi · Rasmus Larsen · Yuefeng Zhou · Naveen Kumar · Samy Bengio · Jeff Dean -
2017 Talk: Neural Optimizer Search using Reinforcement Learning »
Irwan Bello · Barret Zoph · Vijay Vasudevan · Quoc Le -
2017 Talk: Large-Scale Evolution of Image Classifiers »
Esteban Real · Sherry Moore · Andrew Selle · Saurabh Saxena · Yutaka Leon Suematsu · Jie Tan · Quoc Le · Alexey Kurakin -
2017 Talk: Device Placement Optimization with Reinforcement Learning »
Azalia Mirhoseini · Hieu Pham · Quoc Le · benoit steiner · Mohammad Norouzi · Rasmus Larsen · Yuefeng Zhou · Naveen Kumar · Samy Bengio · Jeff Dean