Skip to yearly menu bar Skip to main content


Poster
in
Workshop: HiLD: High-dimensional Learning Dynamics Workshop

Flatter, Faster: Scaling Momentum for Optimal Speedup of SGD

Aditya Cowsik · Tankut Can · Paolo Glorioso


Abstract: Optimization algorithms typically show a trade-off between good generalization and faster training times. While stochastic gradient descent (SGD) offers good generalization, adaptive gradient methods are quicker. Momentum can speed up SGD, but choosing the right hyperparameters can be challenging. We study training dynamics in overparametrized neural networks with SGD, label noise, and momentum. Our findings indicate that scaling the momentum hyperparameter $1-\beta$ with the learning rate to the power of $2/3$ enhances training speed without compromising generalization. Our architecture-independent framework is built on the assumption of existence of a manifold of global minimizes, typically present in overparametrized models. We highlight the emergence of two distinct timescales in training dynamics, with maximum acceleration achieved when these timescales coincide, which leads to our proposed scaling limit. Matrix-sensing, MLP on FashionMNIST, and ResNet-18 on CIFAR10 experiments confirm the validity of our proposed scaling.

Chat is not available.