Poster
in
Workshop: ES-FoMo II: 2nd Workshop on Efficient Systems for Foundation Models
Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations
Alexander Hägele · Elie Bakouch · Atli Kosson · Loubna Ben allal · Leandro Von Werra · Martin Jaggi
Scale has become a crucial factor for obtaining strong machine learning models. As a result, understanding a model's scaling properties is key to designing new neural architectures and training schemes effectively. In this work, we argue that scale and training research has been needlessly complicated by the reliance on the cosine learning rate schedule, which requires a separate run for each training duration of interest. We investigate a direct alternative -- constant learning rate and cooldowns -- that allows reusing compute between runs of different lengths. We analyze different recipes for the schedule and find equivalent or improved performance to cosine, all while scaling predictably and reliably similar to cosine. Additionally, we show that stochastic weight averaging yields strong performance improvements along the training trajectory, without additional training costs, across different scales. Importantly, with these findings, we demonstrate that scaling experiments can be performed with significantly fewer GPU hours and FLOPs. Our code is available at https://github.com/epfml/schedules-and-scaling/.