Timezone: »
Generative language models define distributions over sequences of tokens that can represent essentially any combination of data modalities (e.g., any permutation of image tokens from VQ-VAEs, speech tokens from HuBERT, BPE tokens for language or code, and so on). To better understand the scaling properties of such mixed-modal models, we conducted over 250 experiments using seven different modalities and model sizes ranging from 8 million to 30 billion, trained on 5-100 billion tokens. We report new mixed-modal scaling laws that unify the contributions of individual modalities and the interactions between them. Specifically, we explicitly model the optimal synergy and competition due to data and model size as an additive term to previous uni-modal scaling laws. We also find four empirical phenomena observed during the training, such as emergent coordinate-ascent style training that naturally alternates between modalities, guidelines for selecting critical hyper-parameters, and connections between mixed-modal competition and training stability. Finally, we test our scaling law by training a 30B speech-text model, which significantly outperforms the corresponding unimodal models. Overall, our research provides valuable insights into the design and training of mixed-modal generative models, an important new class of unified models that have unique distributional properties.
Author Information
Armen Aghajanyan (Facebook)
LILI YU (Meta)
Alexis Conneau (OpenAI)
Wei-Ning Hsu (Facebook)
Karen Hambardzumyan (YerevaNN, Yerevan State University)
Susan Zhang
Stephen Roller (Facebook)
Naman Goyal (Facebook)
Omer Levy (Tel Aviv University / Facebook AI Research)
Luke Zettlemoyer (University of Washington)
More from the Same Authors
-
2023 : Retrieval-Augmented Multimodal Language Modeling »
Michihiro Yasunaga · Armen Aghajanyan · Weijia Shi · Rich James · Jure Leskovec · Percy Liang · Mike Lewis · Luke Zettlemoyer · Wen-tau Yih -
2023 Poster: Efficient Self-supervised Learning with Contextualized Target Representations for Vision, Speech and Language »
Alexei Baevski · Arun Babu · Wei-Ning Hsu · Michael Auli -
2023 Oral: Efficient Self-supervised Learning with Contextualized Target Representations for Vision, Speech and Language »
Alexei Baevski · Arun Babu · Wei-Ning Hsu · Michael Auli -
2023 Poster: Text-To-4D Dynamic Scene Generation »
Uriel Singer · Shelly Sheynin · Adam Polyak · Oron Ashual · Iurii Makarov · Filippos Kokkinos · Naman Goyal · Andrea Vedaldi · Devi Parikh · Justin Johnson · Yaniv Taigman -
2023 Poster: The case for 4-bit precision: k-bit Inference Scaling Laws »
Tim Dettmers · Luke Zettlemoyer -
2023 Poster: DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation »
Yuhang Lai · Chengxi Li · Yiming Wang · Tianyi Zhang · Ruiqi Zhong · Luke Zettlemoyer · Scott Yih · Daniel Fried · Sida Wang · Tao Yu -
2023 Poster: Retrieval-Augmented Multimodal Language Modeling »
Michihiro Yasunaga · Armen Aghajanyan · Weijia Shi · Richard James · Jure Leskovec · Percy Liang · Mike Lewis · Luke Zettlemoyer · Scott Yih -
2021 Poster: BASE Layers: Simplifying Training of Large, Sparse Models »
Mike Lewis · Shruti Bhosale · Tim Dettmers · Naman Goyal · Luke Zettlemoyer -
2021 Spotlight: BASE Layers: Simplifying Training of Large, Sparse Models »
Mike Lewis · Shruti Bhosale · Tim Dettmers · Naman Goyal · Luke Zettlemoyer -
2021 Poster: Not All Memories are Created Equal: Learning to Forget by Expiring »
Sainbayar Sukhbaatar · Da JU · Spencer Poff · Stephen Roller · Arthur Szlam · Jason Weston · Angela Fan -
2021 Oral: Not All Memories are Created Equal: Learning to Forget by Expiring »
Sainbayar Sukhbaatar · Da JU · Spencer Poff · Stephen Roller · Arthur Szlam · Jason Weston · Angela Fan -
2020 Poster: Structural Language Models of Code »
Uri Alon · Roy Sadaka · Omer Levy · Eran Yahav