Predicting Large Model Test Losses with a Noisy Quadratic System
Chuning Li ⋅ Chris Maddison
Abstract
We introduce a predictive model that estimates the pre-training loss of large models from model size ($N$), batch size ($B$) and number of weight updates ($K$). This is the first loss prediction model that can handle changing batch size. The model outperforms Chinchilla's loss model, a model of the test loss using the batch size and number of tokens, in terms of projecting the loss at extrapolated compute budgets (up to 1000 folds). A natural use of the model is to find optimal $N,B,K$ configurations under explicit and compound resource constraints like time, memory and compute. In our experiments, the model-selected configurations are close to ground-truth optimal. Our work advocates for loss prediction as a better alternative to heuristic-based laws, which are growing in complexity.
Successful Page Load