How Far Can Quadratics Take Us? Lessons for LLM Pretraining
Abstract
Modern large language model pretraining is governed by complex heuristics — from cosine learning-rate decay to batch-size scheduling. Yet, a growing body of work suggests that an analytically simple quadratic model can accurately predict much of this large-scale optimization behavior. In this talk, I will argue that the quadratic model is not merely a convenient theoretical toy, but a useful lens for pretraining practice — both for compute efficiency and for serial runtime.
We will begin with exact computations of critical batch size and time-dependent learning rates in linear systems, establishing a principled foundation. From there, we will see how the same analysis yields batch-size scaling laws (where we estimate batch-size exponents in LLMs) and motivates two pretraining improvements: SeeSaw, a scheduler that trades learning-rate decay for batch-size growth and matches loss at lower serial runtime; and Horizon-Free Pretraining, which shows how anytime schedules with weight averaging can match carefully tuned cosine decay without committing to a horizon in advance. We will close with lower bounds on the interaction between momentum and batch size, which suggest the quadratic model captures fundamental limits about what any first-order method can achieve.
Taken together, these results make the case that quadratics deserve a more central place in how we think about pretraining.
Speaker