Poster
in
Workshop: High-dimensional Learning Dynamics Workshop: The Emergence of Structure and Reasoning
Analyzing & Eliminating Learning Rate Warmup in GPT Pre-Training
Atli Kosson · Bettina Messmer · Martin Jaggi
Learning Rate Warmup is a popular heuristic for training neural networks, which downscales early updates relative to later ones. This aids training, suggesting that the initial updates are too large in some sense, but why and by which criteria remains unclear. In this work we explore this for small GPT training by assessing and controlling the update size via various metrics. We find the standard L2-norm of the updates to be insufficient, but using relative changes of either the matrix weights or neural representations is promising for reducing or eliminating the need for explicit warmup. Quantifying the updates in representation space in particular can help withstand changes in the gradient signal-to-noise ratio or "critical batch size" throughout training, which warmup can help counteract but simpler weight based methods fail to account for.