Timezone: »
Stochastic gradient descent (SGD) with momentum is widely used for training modern deep learning architectures. While it is well-understood that using momentum can lead to faster convergence rate in various settings, it has also been observed that momentum yields higher generalization. Prior work argue that momentum stabilizes the SGD noise during training and this leads to higher generalization. In this paper, we adopt another perspective and first empirically show that gradient descent with momentum (GD+M) significantly improves generalization compared to gradient descent (GD) in some deep learning problems. From this observation, we formally study how momentum improves generalization. We devise a binary classification setting where a one-hidden layer (over-parameterized) convolutional neural network trained with GD+M provably generalizes better than the same network trained with GD, when both algorithms are similarly initialized. The key insight in our analysis is that momentum is beneficial in datasets where the examples share some feature but differ in their margin. Contrary to GD that memorizes the small margin data, GD+M still learns the feature in these data thanks to its historical gradients. Lastly, we empirically validate our theoretical findings.
Author Information
Samy Jelassi (Princeton University)
Yuanzhi Li (CMU)
Related Events (a corresponding poster, oral, or spotlight)
-
2022 Poster: Towards understanding how momentum improves generalization in deep learning »
Wed. Jul 20th through Thu the 21st Room Hall E #237
More from the Same Authors
-
2021 : When Is Generalizable Reinforcement Learning Tractable? »
Dhruv Malik · Yuanzhi Li · Pradeep Ravikumar -
2021 : Sample Efficient Reinforcement Learning In Continuous State Spaces: A Perspective Beyond Linearity »
Dhruv Malik · Aldo Pacchiano · Vishwak Srinivasan · Yuanzhi Li -
2021 : Towards understanding how momentum improves generalization in deep learning »
Samy Jelassi · Yuanzhi Li -
2023 : How Does Adaptive Optimization Impact Local Neural Network Geometry? »
Kaiqi Jiang · Dhruv Malik · Yuanzhi Li -
2023 : Plan, Eliminate, and Track --- Language Models are Good Teachers for Embodied Agents. »
Yue Wu · So Yeon Min · Yonatan Bisk · Ruslan Salakhutdinov · Amos Azaria · Yuanzhi Li · Tom Mitchell · Shrimai Prabhumoye -
2023 : SPRING: Studying Papers and Reasoning to play Games »
Yue Wu · Shrimai Prabhumoye · So Yeon Min · Yonatan Bisk · Ruslan Salakhutdinov · Amos Azaria · Tom Mitchell · Yuanzhi Li -
2023 : How Do Transformers Learn Topic Structure: Towards a Mechanistic Understanding »
Yuchen Li · Yuanzhi Li · Andrej Risteski -
2023 Poster: Weighted Tallying Bandits: Overcoming Intractability via Repeated Exposure Optimality »
Dhruv Malik · Conor Igoe · Yuanzhi Li · Aarti Singh -
2023 Poster: How Do Transformers Learn Topic Structure: Towards a Mechanistic Understanding »
Yuchen Li · Yuanzhi Li · Andrej Risteski -
2023 Poster: The Benefits of Mixup for Feature Learning »
Difan Zou · Yuan Cao · Yuanzhi Li · Quanquan Gu -
2021 : Towards understanding how momentum improves generalization in deep learning »
Samy Jelassi · Yuanzhi Li -
2021 Poster: Sample Efficient Reinforcement Learning In Continuous State Spaces: A Perspective Beyond Linearity »
Dhruv Malik · Aldo Pacchiano · Vishwak Srinivasan · Yuanzhi Li -
2021 Spotlight: Sample Efficient Reinforcement Learning In Continuous State Spaces: A Perspective Beyond Linearity »
Dhruv Malik · Aldo Pacchiano · Vishwak Srinivasan · Yuanzhi Li -
2021 Poster: Toward Understanding the Feature Learning Process of Self-supervised Contrastive Learning »
Zixin Wen · Yuanzhi Li -
2021 Spotlight: Toward Understanding the Feature Learning Process of Self-supervised Contrastive Learning »
Zixin Wen · Yuanzhi Li -
2020 Poster: Extra-gradient with player sampling for faster convergence in n-player games »
Samy Jelassi · Carles Domingo-Enrich · Damien Scieur · Arthur Mensch · Joan Bruna -
2019 Poster: Neuron birth-death dynamics accelerates gradient descent and converges asymptotically »
Grant Rotskoff · Samy Jelassi · Joan Bruna · Eric Vanden-Eijnden -
2019 Oral: Neuron birth-death dynamics accelerates gradient descent and converges asymptotically »
Grant Rotskoff · Samy Jelassi · Joan Bruna · Eric Vanden-Eijnden