Why Do We Need Warm-up? A Theoretical Perspective
Foivos Alimisis ⋅ Rustem Islamov ⋅ Aurelien Lucchi
Abstract
Learning rate warm-up -- increasing the learning rate at the beginning of training -- has become a ubiquitous heuristic in modern deep learning, yet its theoretical foundations remain poorly understood. In this work, we provide a principled explanation for why warm-up improves training. We rely on a generalization of the $(L_0, L_1)$-smoothness condition, which bounds local curvature as a linear function of the loss sub-optimality and exhibits desirable closure properties. We show -- both theoretically and empirically -- that this condition is satisfied by common neural architectures and accurately captures the curvature of the optimization landscape early in training. Adapting the learning rate in response to this curvature condition naturally induces a warm-up–like schedule, and we show that this choice yields provably faster convergence guarantees than using a fixed learning rate. Finally, we validate our theoretical insights through experiments on language and vision models, confirming the agreement between our theoretically derived schedule and standard warm-up.
Successful Page Load