Timezone: »

 
Poster
What Language Model Architecture and Pretraining Objective Works Best for Zero-Shot Generalization?
Colin Raffel · Adam Roberts · Hyung Won Chung · Iz Beltagy · Daniel Hesslow · Julien Launay · Thomas Wang · Teven Le Scao

Tue Jul 19 03:30 PM -- 05:30 PM (PDT) @ Hall E #129

Large pretrained Transformer language models have been shown to exhibit zero-shot generalization, i.e. they can perform a wide variety of tasks that they were not explicitly trained on. However, the architectures and pretraining objectives used across state-of-the-art models differ significantly, and there has been limited systematic comparison of these factors. In this work, we present a large-scale evaluation of modeling choices and their impact on zero-shot generalization. In particular, we focus on text-to-text models and experiment with multiple model architectures (causal/non-causal decoder-only and encoder-decoder) trained with different pretraining objectives (autoregressive and masked language modeling) evaluated with and without intermediate multitask prompted training. Our models are significantly larger than those considered in past studies (with 5+ billion parameters trained for 168 billion tokens) thereby increasing the chance that our conclusions will transfer to even larger scales. Our experiments show that causal decoder-only models trained on an autoregressive language modeling objective exhibit the strongest zero-shot generalization among those with purely unsupervised pretraining. However, models with non-causal visibility on their input that have been trained with a masked language modeling objective followed by multitask adaptation perform the best among our experiments. We therefore consider the application of autoregressive language modeling as a downstream task, which allows for a pretrained non-causal decoder model to be efficiently adapted into a performant autoregressive causal decoder model. To facilitate future work on large language models, we release all models, datasets, and code used in this study.

Author Information

Colin Raffel (Google Brain)
Adam Roberts (Google Brain)
Hyung Won Chung (Google)
Iz Beltagy (Allen Institute for AI (AI2))
Daniel Hesslow (Lighton)
Julien Launay (École Normale Supérieure)
Thomas Wang (Hugging Face)
Teven Le Scao (Hugging Face)

Related Events (a corresponding poster, oral, or spotlight)

More from the Same Authors