Timezone: »
Continual Pre-Training of Large Language Models: How to re-warm your model?
Kshitij Gupta · Benjamin Thérien · Adam Ibrahim · Mats Richter · Quentin Anthony · Eugene Belilovsky · Timothée Lesort · Irina Rish
Event URL: https://openreview.net/forum?id=pg7PUJe0Tl »
Large language models (LLMs) are routinely pre-trained on billions of tokens, only to restart the process over again once new data becomes available. Since the size of available datasets and models have drastically increased, retraining models from scratch has become increasingly costly. A much cheaper and more efficient solution is to enable the continual pre-training of these models, i.e. updating pre-trained models with new data instead of re-training them. However, the distribution shift induced by novel data typically results in degraded performance on past data. We take a step towards continual pre-training, we examine the effect of different warm-up strategies (e.g. varying the number of linear warm-up steps and the maximum learning rate) on upstream (Pile) and downstream (RedPajama) dataset performance. We conduct all experiments on the Pythia $410$M language model pre-trained on $300$B tokens from the Pile. Our results show that re-warming the learning rate leads to a decrease in performance based on a limited compute budget. Consequently, the best strategy based on stopping at $50$B tokens is to avoid re-warming the learning rate altogether, keeping it constant.
Large language models (LLMs) are routinely pre-trained on billions of tokens, only to restart the process over again once new data becomes available. Since the size of available datasets and models have drastically increased, retraining models from scratch has become increasingly costly. A much cheaper and more efficient solution is to enable the continual pre-training of these models, i.e. updating pre-trained models with new data instead of re-training them. However, the distribution shift induced by novel data typically results in degraded performance on past data. We take a step towards continual pre-training, we examine the effect of different warm-up strategies (e.g. varying the number of linear warm-up steps and the maximum learning rate) on upstream (Pile) and downstream (RedPajama) dataset performance. We conduct all experiments on the Pythia $410$M language model pre-trained on $300$B tokens from the Pile. Our results show that re-warming the learning rate leads to a decrease in performance based on a limited compute budget. Consequently, the best strategy based on stopping at $50$B tokens is to avoid re-warming the learning rate altogether, keeping it constant.
Author Information
Kshitij Gupta (Mila)
Benjamin Thérien (University of Waterloo)
Adam Ibrahim (Mila, University of Montreal)
Mats Richter
Quentin Anthony (Ohio State University, Columbus)
Eugene Belilovsky (Mila)
Timothée Lesort (UdeM, Mila)
Irina Rish (MILA / Université de Montréal h)
More from the Same Authors
-
2022 : Towards Out-of-Distribution Adversarial Robustness »
Adam Ibrahim · Charles Guille-Escuret · Ioannis Mitliagkas · Irina Rish · David Krueger · Pouya Bashivan -
2023 : Preventing Dimensional Collapse in Contrastive Local Learning with Subsampling »
Louis Fournier · Adeetya Patel · Michael Eickenberg · Edouard Oyallon · Eugene Belilovsky -
2023 : Towards Out-of-Distribution Adversarial Robustness »
Adam Ibrahim · Charles Guille-Escuret · Ioannis Mitliagkas · Irina Rish · David Krueger · Pouya Bashivan -
2023 : Learning to Optimize with Recurrent Hierarchical Transformers »
Abhinav Moudgil · Boris Knyazev · Guillaume Lajoie · Eugene Belilovsky -
2023 : Maximum State Entropy Exploration using Predecessor and Successor Representations »
Arnav Kumar Jain · Lucas Lehnert · Irina Rish · Glen Berseth -
2023 : Suboptimal Data Can Bottleneck Scaling »
Jacob Buckman · Kshitij Gupta · Ethan Caballero · Rishabh Agarwal · Marc Bellemare -
2023 : Re-Weighted Softmax Cross-Entropy to Control Forgetting in Federated Learning »
Gwen Legate · Lucas Caccia · Eugene Belilovsky -
2023 : Guiding The Last Layer in Federated Learning with Pre-Trained Models »
Gwen Legate · Nicolas Bernier · Lucas Caccia · Edouard Oyallon · Eugene Belilovsky -
2023 : Cognitive Models as Simulators: Using Cognitive Models to Tap into Implicit Human Feedback »
Ardavan S. Nobandegani · Thomas Shultz · Irina Rish -
2023 Workshop: Localized Learning: Decentralized Model Updates via Non-Global Objectives »
David I. Inouye · Mengye Ren · Mateusz Malinowski · Michael Eickenberg · Gao Huang · Eugene Belilovsky -
2023 Poster: Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling »
Stella Biderman · Hailey Schoelkopf · Quentin Anthony · Herbie Bradley · Kyle O'Brien · Eric Hallahan · Mohammad Aflah Khan · Shivanshu Purohit · USVSN Sai Prashanth · Edward Raff · Aviya Skowron · Lintang Sutawika · Oskar van der Wal -
2023 Oral: Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling »
Stella Biderman · Hailey Schoelkopf · Quentin Anthony · Herbie Bradley · Kyle O'Brien · Eric Hallahan · Mohammad Aflah Khan · Shivanshu Purohit · USVSN Sai Prashanth · Edward Raff · Aviya Skowron · Lintang Sutawika · Oskar van der Wal -
2023 Poster: Prototype-Sample Relation Distillation: Towards Replay-Free Continual Learning »
Nader Asadi · MohammadReza Davari · Sudhir Mudur · Rahaf Aljundi · Eugene Belilovsky -
2023 Poster: Can Forward Gradient Match Backpropagation? »
Louis Fournier · Stéphane Rivaud SORBONNE UNIVERSITE ISIR · Eugene Belilovsky · Michael Eickenberg · Edouard Oyallon -
2022 Poster: Towards Scaling Difference Target Propagation by Learning Backprop Targets »
Maxence ERNOULT · Fabrice Normandin · Abhinav Moudgil · Sean Spinney · Eugene Belilovsky · Irina Rish · Blake Richards · Yoshua Bengio -
2022 Spotlight: Towards Scaling Difference Target Propagation by Learning Backprop Targets »
Maxence ERNOULT · Fabrice Normandin · Abhinav Moudgil · Sean Spinney · Eugene Belilovsky · Irina Rish · Blake Richards · Yoshua Bengio -
2021 : Panel Discussion1 »
Razvan Pascanu · Irina Rish -
2020 : Panel Discussion »
Eric Eaton · Martha White · Doina Precup · Irina Rish · Harm van Seijen -
2020 : Q&A with Irina Rish »
Irina Rish · Shagun Sodhani · Sarath Chandar -
2020 : Invited Talk: Lifelong Learning: Towards Broad and Robust AI by Irina Rish »
Irina Rish -
2020 Workshop: Workshop on Continual Learning »
Haytham Fayek · Arslan Chaudhry · David Lopez-Paz · Eugene Belilovsky · Jonathan Richard Schwarz · Marc Pickett · Rahaf Aljundi · Sayna Ebrahimi · Razvan Pascanu · Puneet Dokania -
2019 Poster: Greedy Layerwise Learning Can Scale To ImageNet »
Eugene Belilovsky · Michael Eickenberg · Edouard Oyallon -
2019 Oral: Greedy Layerwise Learning Can Scale To ImageNet »
Eugene Belilovsky · Michael Eickenberg · Edouard Oyallon