Timezone: »
Since hardware resources are limited, the objective of training deep learning models is typically to maximize accuracy subject to the time and memory constraints of training and inference. We study the impact of model size in this setting, focusing on Transformer models for NLP tasks that are limited by compute: self-supervised pretraining and high-resource machine translation. We first show that even though smaller Transformer models execute faster per iteration, wider and deeper models converge in significantly fewer steps. Moreover, this acceleration in convergence typically outpaces the additional computational overhead of using larger models. Therefore, the most compute-efficient training strategy is to counterintuitively train extremely large models but stop after a small number of iterations. This leads to an apparent trade-off between the training efficiency of large Transformer models and the inference efficiency of small Transformer models. However, we show that large models are more robust to compression techniques such as quantization and pruning than small models. Consequently, one can get the best of both worlds: heavily compressed, large models achieve higher accuracy than lightly compressed, small models.
Author Information
Zhuohan Li (UC Berkeley)
Eric Wallace (U.C. Berkeley)
Sheng Shen (University of California, Berkeley)
Kevin Lin (UC Berkeley)
Kurt Keutzer (UC Berkeley)
Dan Klein (UC Berkeley)
Joseph Gonzalez (UC Berkeley)
More from the Same Authors
-
2021 : Learning Space Partitions for Path Planning »
Kevin Yang · Tianjun Zhang · Chris Cummins · Brandon Cui · Benoit Steiner · Linnan Wang · Joseph E Gonzalez · Dan Klein · Yuandong Tian -
2023 Poster: Poisoning Language Models During Instruction Tuning »
Alexander Wan · Eric Wallace · Sheng Shen · Dan Klein -
2023 Poster: Large Language Models Struggle to Learn Long-Tail Knowledge »
Nikhil Kandpal · Haikang Deng · Adam Roberts · Eric Wallace · Colin Raffel -
2023 Poster: High-throughput Generative Inference of Large Language Models with a Single GPU »
Ying Sheng · Lianmin Zheng · Binhang Yuan · Zhuohan Li · Max Ryabinin · Beidi Chen · Percy Liang · Ce Zhang · Ion Stoica · Christopher Re -
2023 Oral: High-throughput Generative Inference of Large Language Models with a Single GPU »
Ying Sheng · Lianmin Zheng · Binhang Yuan · Zhuohan Li · Max Ryabinin · Beidi Chen · Percy Liang · Ce Zhang · Ion Stoica · Christopher Re -
2022 Poster: GACT: Activation Compressed Training for Generic Network Architectures »
Xiaoxuan Liu · Lianmin Zheng · Dequan Wang · Yukuo Cen · Weize Chen · Xu Han · Jianfei Chen · Zhiyuan Liu · Jie Tang · Joseph Gonzalez · Michael Mahoney · Alvin Cheung -
2022 Poster: Staged Training for Transformer Language Models »
Sheng Shen · Pete Walsh · Kurt Keutzer · Jesse Dodge · Matthew Peters · Iz Beltagy -
2022 Spotlight: Staged Training for Transformer Language Models »
Sheng Shen · Pete Walsh · Kurt Keutzer · Jesse Dodge · Matthew Peters · Iz Beltagy -
2022 Spotlight: GACT: Activation Compressed Training for Generic Network Architectures »
Xiaoxuan Liu · Lianmin Zheng · Dequan Wang · Yukuo Cen · Weize Chen · Xu Han · Jianfei Chen · Zhiyuan Liu · Jie Tang · Joseph Gonzalez · Michael Mahoney · Alvin Cheung -
2022 Poster: Deduplicating Training Data Mitigates Privacy Risks in Language Models »
Nikhil Kandpal · Eric Wallace · Colin Raffel -
2022 Spotlight: Deduplicating Training Data Mitigates Privacy Risks in Language Models »
Nikhil Kandpal · Eric Wallace · Colin Raffel -
2022 Poster: Describing Differences between Text Distributions with Natural Language »
Ruiqi Zhong · Charlie Snell · Dan Klein · Jacob Steinhardt -
2022 Spotlight: Describing Differences between Text Distributions with Natural Language »
Ruiqi Zhong · Charlie Snell · Dan Klein · Jacob Steinhardt -
2022 : Inter-Operator Parallelism »
Zhuohan Li -
2022 Tutorial: Welcome to the "Big Model" Era: Techniques and Systems to Train and Serve Bigger Models »
Hao Zhang · Lianmin Zheng · Zhuohan Li · Ion Stoica -
2021 Poster: Calibrate Before Use: Improving Few-shot Performance of Language Models »
Tony Z. Zhao · Eric Wallace · Shi Feng · Dan Klein · Sameer Singh -
2021 Oral: Calibrate Before Use: Improving Few-shot Performance of Language Models »
Tony Z. Zhao · Eric Wallace · Shi Feng · Dan Klein · Sameer Singh -
2021 Poster: TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models »
Zhuohan Li · Siyuan Zhuang · Shiyuan Guo · Danyang Zhuo · Hao Zhang · Dawn Song · Ion Stoica -
2021 Spotlight: TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models »
Zhuohan Li · Siyuan Zhuang · Shiyuan Guo · Danyang Zhuo · Hao Zhang · Dawn Song · Ion Stoica -
2020 Poster: PowerNorm: Rethinking Batch Normalization in Transformers »
Sheng Shen · Zhewei Yao · Amir Gholaminejad · Michael Mahoney · Kurt Keutzer -
2017 Poster: Modular Multitask Reinforcement Learning with Policy Sketches »
Jacob Andreas · Dan Klein · Sergey Levine -
2017 Talk: Modular Multitask Reinforcement Learning with Policy Sketches »
Jacob Andreas · Dan Klein · Sergey Levine