Timezone: »
Large pretrained Transformer language models have been shown to exhibit zero-shot generalization, i.e. they can perform a wide variety of tasks that they were not explicitly trained on. However, the architectures and pretraining objectives used across state-of-the-art models differ significantly, and there has been limited systematic comparison of these factors. In this work, we present a large-scale evaluation of modeling choices and their impact on zero-shot generalization. In particular, we focus on text-to-text models and experiment with three model architectures (causal/non-causal decoder-only and encoder-decoder), trained with two different pretraining objectives (autoregressive and masked language modeling), and evaluated with and without multitask prompted finetuning. We train models with over 5 billion parameters for more than 168 billion tokens, thereby increasing the likelihood that our conclusions will transfer to even larger scales. Our experiments show that causal decoder-only models trained on an autoregressive language modeling objective exhibit the strongest zero-shot generalization after purely self-supervised pretraining. However, models with non-causal visibility on their input trained with a masked language modeling objective followed by multitask finetuning perform the best among our experiments. We therefore consider the adaptation of pretrained models across architectures and objectives. Code and checkpoints are available at https://github.com/bigscience- workshop/architecture-objective.
Author Information
Thomas Wang (Hugging Face)
Adam Roberts (Google Brain)
Daniel Hesslow (Lighton)
Teven Le Scao (Hugging Face)
Hyung Won Chung (Google)
Iz Beltagy (Allen Institute for AI (AI2))
Julien Launay (École Normale Supérieure)
Colin Raffel (Google Brain)
Related Events (a corresponding poster, oral, or spotlight)
-
2022 Poster: What Language Model Architecture and Pretraining Objective Works Best for Zero-Shot Generalization? »
Tue. Jul 19th through Wed the 20th Room Hall E #129
More from the Same Authors
-
2021 : ROPUST: Improving Robustness through Fine-tuning with Photonic Processors and Synthetic Gradients »
Alessandro Cappelli · Ruben Ohana · Julien Launay · Laurent Meunier · Iacopo Poli -
2023 Poster: Git-Theta: A Git Extension for Collaborative Development of Machine Learning Models »
Nikhil Kandpal · Brian Lester · Mohammed Muqeeth · Anisha Mascarenhas · Monty Evans · Vishal Baskaran · Tenghao Huang · Haokun Liu · Colin Raffel -
2023 Poster: Large Language Models Struggle to Learn Long-Tail Knowledge »
Nikhil Kandpal · Haikang Deng · Adam Roberts · Eric Wallace · Colin Raffel -
2023 Workshop: ES-FoMo: Efficient Systems for Foundation Models »
Julien Launay · Daniel Y Fu · Tri Dao · Daniel Hesslow · Beidi Chen · Azalia Mirhoseini · Percy Liang -
2022 Workshop: The First Workshop on Pre-training: Perspectives, Pitfalls, and Paths Forward »
Huaxiu Yao · Hugo Larochelle · Percy Liang · Colin Raffel · Jian Tang · Ying WEI · Saining Xie · Eric Xing · Chelsea Finn -
2022 : RITA: a Study on Scaling Up Generative Protein Sequence Models »
Daniel Hesslow -
2022 Poster: Staged Training for Transformer Language Models »
Sheng Shen · Pete Walsh · Kurt Keutzer · Jesse Dodge · Matthew Peters · Iz Beltagy -
2022 Spotlight: Staged Training for Transformer Language Models »
Sheng Shen · Pete Walsh · Kurt Keutzer · Jesse Dodge · Matthew Peters · Iz Beltagy -
2022 Poster: Deduplicating Training Data Mitigates Privacy Risks in Language Models »
Nikhil Kandpal · Eric Wallace · Colin Raffel -
2022 Spotlight: Deduplicating Training Data Mitigates Privacy Risks in Language Models »
Nikhil Kandpal · Eric Wallace · Colin Raffel -
2017 Poster: Online and Linear-Time Attention by Enforcing Monotonic Alignments »
Colin Raffel · Thang Luong · Peter Liu · Ron Weiss · Douglas Eck -
2017 Talk: Online and Linear-Time Attention by Enforcing Monotonic Alignments »
Colin Raffel · Thang Luong · Peter Liu · Ron Weiss · Douglas Eck