Skip to yearly menu bar Skip to main content


Spotlight
in
Workshop: Over-parameterization: Pitfalls and Opportunities

Towards understanding how momentum improves generalization in deep learning

Samy Jelassi · Yuanzhi Li


Abstract:

Stochastic gradient descent (SGD) with momentum is widely used for training modern deep learning architectures. While it is well understood that using momentum can lead to faster convergence rate in various settings, it has also been empirically observed that adding momentum yields higher generalization. This paper formally studies how momentum help generalization in deep learning: