Spotlight
in
Workshop: Over-parameterization: Pitfalls and Opportunities
Towards understanding how momentum improves generalization in deep learning
Samy Jelassi · Yuanzhi Li
Abstract:
Stochastic gradient descent (SGD) with momentum is widely used for training modern deep learning architectures. While it is well understood that using momentum can lead to faster convergence rate in various settings, it has also been empirically observed that adding momentum yields higher generalization. This paper formally studies how momentum help generalization in deep learning: