Skip to yearly menu bar Skip to main content


Poster
in
Workshop: High-dimensional Learning Dynamics Workshop: The Emergence of Structure and Reasoning

Analyzing & Eliminating Learning Rate Warmup in GPT Pre-Training

Atli Kosson · Bettina Messmer · Martin Jaggi


Abstract:

Learning Rate Warmup is a popular heuristic for training neural networks, which downscales early updates relative to later ones. This aids training, suggesting that the initial updates are too large in some sense, but why and by which criteria remains unclear. In this work we explore this for small GPT training by assessing and controlling the update size via various metrics. We find the standard L2-norm of the updates to be insufficient, but using relative changes of either the matrix weights or neural representations is promising for reducing or eliminating the need for explicit warmup. Quantifying the updates in representation space in particular can help withstand changes in the gradient signal-to-noise ratio or "critical batch size" throughout training, which warmup can help counteract but simpler weight based methods fail to account for.

Chat is not available.