What Language Model Architecture and Pretraining Objective Works Best for Zero-Shot Generalization?

Thomas Wang · Adam Roberts · Daniel Hesslow · Teven Le Scao · Hyung Won Chung · Iz Beltagy · Julien Launay · Colin Raffel

Hall E #129

Keywords: [ MISC: Transfer, Multitask and Meta-learning ] [ DL: Attention Mechanisms ] [ APP: Language, Speech and Dialog ]


Large pretrained Transformer language models have been shown to exhibit zero-shot generalization, i.e. they can perform a wide variety of tasks that they were not explicitly trained on. However, the architectures and pretraining objectives used across state-of-the-art models differ significantly, and there has been limited systematic comparison of these factors. In this work, we present a large-scale evaluation of modeling choices and their impact on zero-shot generalization. In particular, we focus on text-to-text models and experiment with three model architectures (causal/non-causal decoder-only and encoder-decoder), trained with two different pretraining objectives (autoregressive and masked language modeling), and evaluated with and without multitask prompted finetuning. We train models with over 5 billion parameters for more than 168 billion tokens, thereby increasing the likelihood that our conclusions will transfer to even larger scales. Our experiments show that causal decoder-only models trained on an autoregressive language modeling objective exhibit the strongest zero-shot generalization after purely self-supervised pretraining. However, models with non-causal visibility on their input trained with a masked language modeling objective followed by multitask finetuning perform the best among our experiments. We therefore consider the adaptation of pretrained models across architectures and objectives. Code and checkpoints are available at workshop/architecture-objective.

Chat is not available.