Skip to yearly menu bar Skip to main content

Workshop: ES-FoMo: Efficient Systems for Foundation Models

Continual Pre-Training of Large Language Models: How to re-warm your model?

Kshitij Gupta · Benjamin Thérien · Adam Ibrahim · Mats Richter · Quentin Anthony · Eugene Belilovsky · Timothée Lesort · Irina Rish

Abstract: Large language models (LLMs) are routinely pre-trained on billions of tokens, only to restart the process over again once new data becomes available. Since the size of available datasets and models have drastically increased, retraining models from scratch has become increasingly costly. A much cheaper and more efficient solution is to enable the continual pre-training of these models, i.e. updating pre-trained models with new data instead of re-training them. However, the distribution shift induced by novel data typically results in degraded performance on past data. We take a step towards continual pre-training, we examine the effect of different warm-up strategies (e.g. varying the number of linear warm-up steps and the maximum learning rate) on upstream (Pile) and downstream (RedPajama) dataset performance. We conduct all experiments on the Pythia $410$M language model pre-trained on $300$B tokens from the Pile. Our results show that re-warming the learning rate leads to a decrease in performance based on a limited compute budget. Consequently, the best strategy based on stopping at $50$B tokens is to avoid re-warming the learning rate altogether, keeping it constant.

Chat is not available.