Poster
A Hitchhiker's Guide to Scaling Law Estimation
Leshem Choshen · Yang Zhang · Jacob Andreas
East Exhibition Hall A-B #E-2506
Scaling laws predict the loss of a target machine learning model by extrapolatingfrom easier-to-train models with fewer parameters or smaller training sets. Thisprovides an efficient way for practitioners and researchers alike to compare pre-training decisions involving optimizers, datasets, and model architectures. Despitethe widespread use of scaling laws to model the dynamics of language modeltraining, there has been little work on understanding how to best estimate andinterpret them. We collect (and release) a large-scale dataset containing losses anddownstream evaluations for 485 previously published pretrained models. We usethese to estimate more than 1000 scaling laws, then derive a set of best practicesfor estimating scaling laws in new model families. We find that fitting scaling lawsto intermediate checkpoints of training runs (and not just their final losses) substan-tially improves accuracy, and that—all else equal—estimates of performance aregenerally most accurate when derived from other models of similar sizes. However,because there is a significant degree of variability across model seeds, trainingmultiple small models is sometimes more useful than training a single large one.Moreover, while different model families differ in scaling behavior, they are oftensimilar enough that a target model’s behavior can be predicted from a single modelwith the same architecture, along with scaling parameter estimates derived fromother model families.
Training large language models is expensive and time-consuming, so researchers often try to predict how a model will perform before fully training it. One common technique is using "scaling laws," which are mathematical formulas that estimate how a model’s performance changes as you increase the size of the model or the training data. These predictions help guide decisions like which architecture to use or how much data to collect.In this paper, we take a closer look at how to make these predictions more accurate and useful. We gather a large dataset of nearly 500 language models and analyze how well different scaling law methods work. We discover that using not just the final results of model training but also data from earlier training steps leads to better predictions. We also find that comparing models of similar sizes gives more reliable results, and that training a few small models can sometimes give you more insight than training one big one. Even though different types of models scale differently, we show that it’s often possible to predict how a new model will behave based on similar models and a few smart estimates. Our work offers practical tips for researchers trying to design and train better models more efficiently.
Live content is unavailable. Log in and register to view live content