Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Next Generation of Sequence Modeling Architectures

Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations

Alexander Hägele · Elie Bakouch · Atli Kosson · Loubna Ben allal · Leandro Von Werra · Martin Jaggi


Abstract:

Scale has become a crucial factor for obtaining strong machine learning models. As a result, understanding a model's scaling properties is key to designing new neural architectures and training schemes effectively. In this work, we argue that scale and training research has been needlessly complicated by the reliance on the cosine learning rate schedule, which requires a separate run for each training duration of interest. We investigate a direct alternative -- constant learning rate and cooldowns -- that allows reusing compute between runs of different lengths. We analyze different recipes for the schedule and find equivalent or improved performance to cosine, all while scaling predictably and reliably similar to cosine. Additionally, we show that stochastic weight averaging yields strong performance improvements along the training trajectory, without additional training costs, across different scales. Importantly, with these findings, we demonstrate that scaling experiments can be performed with significantly fewer GPU hours and FLOPs.

Chat is not available.