Invited Talk Tue, Jul 7, 2026 • 4:30 PM – 5:30 PM PDT HALL C

How Far Can Quadratics Take Us? Lessons for LLM Pretraining

Sham Kakade

Abstract

Modern large language model pretraining is governed by complex heuristics — from cosine learning-rate decay to batch-size scheduling. Yet, a growing body of work suggests that an analytically simple quadratic model can accurately predict much of this large-scale optimization behavior. In this talk, I will argue that the quadratic model is not merely a convenient theoretical toy, but a useful lens for pretraining practice — both for compute efficiency and for serial runtime.

We will begin with exact computations of critical batch size and time-dependent learning rates in linear systems, establishing a principled foundation. From there, we will see how the same analysis yields batch-size scaling laws (where we estimate batch-size exponents in LLMs) and motivates two pretraining improvements: SeeSaw, a scheduler that trades learning-rate decay for batch-size growth and matches loss at lower serial runtime; and Horizon-Free Pretraining, which shows how anytime schedules with weight averaging can match carefully tuned cosine decay without committing to a horizon in advance. We will close with lower bounds on the interaction between momentum and batch size, which suggest the quadratic model captures fundamental limits about what any first-order method can achieve.

Taken together, these results make the case that quadratics deserve a more central place in how we think about pretraining.

Speaker

Sham Kakade

I work on advancing the fundamental capabilities needed to develop artificial general intelligence and create systems that can effectively interact with and add value to the real world. My current interests include: (i) developing full-stack training pipelines for foundation models, with particular focus on distributed systems architecture, scalable optimization algorithms, and principled approaches to data curation and composition. (ii) investigating the mathematical and scientific principles that govern large-scale learning systems, with emphasis on understanding emergent capabilities, scaling laws, and fundamental limits of neural architectures. (iii) advancing autonomous agent architectures that can reason, plan, and learn from interaction, with focus on bridging the gap between language models and embodied intelligence in complex environments.