Timezone: »
This work examines the deep disconnect between existing theoretical analyses of gradient-based algorithms and the practice of training deep neural networks. Specifically, we provide numerical evidence that in large-scale neural network training (e.g., ImageNet + ResNet101, and WT103 + TransformerXL models), the neural network's weights do not converge to stationary points where the gradient of the loss is zero. Remarkably, however, we observe that even though the weights do not converge to stationary points, the progress in minimizing the loss function halts and training loss stabilizes. Inspired by this observation, we propose a new perspective based on ergodic theory of dynamical systems to explain it. Rather than studying the evolution of weights, we study the evolution of the distribution of weights. We prove convergence of the distribution of weights to an approximate invariant measure, thereby explaining how the training loss can stabilize without weights necessarily converging to stationary points. We further discuss how this perspective can better align optimization theory with empirical observations in machine learning practice.
Author Information
Jingzhao Zhang (Tsinghua University)
Haochuan Li (MIT)
Suvrit Sra (MIT & Macro-Eyes)
Ali Jadbabaie (Massachusetts Institute of Technology)
Related Events (a corresponding poster, oral, or spotlight)
-
2022 Poster: Neural Network Weights Do Not Converge to Stationary Points: An Invariant Measure Perspective »
Tue. Jul 19th through Wed the 20th Room Hall E #613
More from the Same Authors
-
2023 Poster: Global optimality for Euclidean CCCP under Riemannian convexity »
Melanie Weber · Suvrit Sra -
2023 Poster: On the Training Instability of Shuffling SGD with Batch Normalization »
David X. Wu · Chulhee Yun · Suvrit Sra -
2022 : Sign and Basis Invariant Networks for Spectral Graph Representation Learning »
Derek Lim · Joshua Robinson · Lingxiao Zhao · Tess Smidt · Suvrit Sra · Haggai Maron · Stefanie Jegelka -
2022 Poster: Beyond Worst-Case Analysis in Stochastic Approximation: Moment Estimation Improves Instance Complexity »
Jingzhao Zhang · Hongzhou Lin · Subhro Das · Suvrit Sra · Ali Jadbabaie -
2022 Spotlight: Beyond Worst-Case Analysis in Stochastic Approximation: Moment Estimation Improves Instance Complexity »
Jingzhao Zhang · Hongzhou Lin · Subhro Das · Suvrit Sra · Ali Jadbabaie -
2022 Poster: Understanding the unstable convergence of gradient descent »
Kwangjun Ahn · Jingzhao Zhang · Suvrit Sra -
2022 Poster: On Convergence of Gradient Descent Ascent: A Tight Local Analysis »
Haochuan Li · Farzan Farnia · Subhro Das · Ali Jadbabaie -
2022 Spotlight: On Convergence of Gradient Descent Ascent: A Tight Local Analysis »
Haochuan Li · Farzan Farnia · Subhro Das · Ali Jadbabaie -
2022 Spotlight: Understanding the unstable convergence of gradient descent »
Kwangjun Ahn · Jingzhao Zhang · Suvrit Sra -
2021 Poster: Provably Efficient Algorithms for Multi-Objective Competitive RL »
Tiancheng Yu · Yi Tian · Jingzhao Zhang · Suvrit Sra -
2021 Poster: Online Learning in Unknown Markov Games »
Yi Tian · Yuanhao Wang · Tiancheng Yu · Suvrit Sra -
2021 Spotlight: Online Learning in Unknown Markov Games »
Yi Tian · Yuanhao Wang · Tiancheng Yu · Suvrit Sra -
2021 Oral: Provably Efficient Algorithms for Multi-Objective Competitive RL »
Tiancheng Yu · Yi Tian · Jingzhao Zhang · Suvrit Sra -
2021 Poster: Three Operator Splitting with a Nonconvex Loss Function »
Alp Yurtsever · Varun Mangalick · Suvrit Sra -
2021 Spotlight: Three Operator Splitting with a Nonconvex Loss Function »
Alp Yurtsever · Varun Mangalick · Suvrit Sra -
2020 Poster: Strength from Weakness: Fast Learning Using Weak Supervision »
Joshua Robinson · Stefanie Jegelka · Suvrit Sra -
2020 Poster: Complexity of Finding Stationary Points of Nonconvex Nonsmooth Functions »
Jingzhao Zhang · Hongzhou Lin · Stefanie Jegelka · Suvrit Sra · Ali Jadbabaie -
2020 Poster: Learning Adversarial Markov Decision Processes with Bandit Feedback and Unknown Transition »
Chi Jin · Tiancheng Jin · Haipeng Luo · Suvrit Sra · Tiancheng Yu -
2019 Poster: Escaping Saddle Points with Adaptive Gradient Methods »
Matthew Staib · Sashank Jakkam Reddi · Satyen Kale · Sanjiv Kumar · Suvrit Sra -
2019 Oral: Escaping Saddle Points with Adaptive Gradient Methods »
Matthew Staib · Sashank Jakkam Reddi · Satyen Kale · Sanjiv Kumar · Suvrit Sra -
2019 Poster: Random Shuffling Beats SGD after Finite Epochs »
Jeff HaoChen · Suvrit Sra -
2019 Oral: Random Shuffling Beats SGD after Finite Epochs »
Jeff HaoChen · Suvrit Sra -
2019 Poster: Gradient Descent Finds Global Minima of Deep Neural Networks »
Simon Du · Jason Lee · Haochuan Li · Liwei Wang · Xiyu Zhai -
2019 Poster: Conditional Gradient Methods via Stochastic Path-Integrated Differential Estimator »
Alp Yurtsever · Suvrit Sra · Volkan Cevher -
2019 Oral: Conditional Gradient Methods via Stochastic Path-Integrated Differential Estimator »
Alp Yurtsever · Suvrit Sra · Volkan Cevher -
2019 Oral: Gradient Descent Finds Global Minima of Deep Neural Networks »
Simon Du · Jason Lee · Haochuan Li · Liwei Wang · Xiyu Zhai