Timezone: »
Adaptive methods such as Adam and RMSProp are widely used in deep learning but are not well understood. In this paper, we seek a crisp, clean and precise characterization of their behavior in nonconvex settings. To this end, we first provide a novel view of adaptive methods as preconditioned SGD, where the preconditioner is estimated in an online manner. By studying the preconditioner on its own, we elucidate its purpose: it rescales the stochastic gradient noise to be isotropic near stationary points, which helps escape saddle points. Furthermore, we show that adaptive methods can efficiently estimate the aforementioned preconditioner. By gluing together these two components, we provide the first (to our knowledge) second-order convergence result for any adaptive method. The key insight from our analysis is that, compared to SGD, adaptive methods escape saddle points faster, and can converge faster overall to second-order stationary points.
Author Information
Matthew Staib (MIT)
Sashank Jakkam Reddi (Google)
Satyen Kale (Google)
Sanjiv Kumar (Google Research, NY)
Suvrit Sra (MIT)
Related Events (a corresponding poster, oral, or spotlight)
-
2019 Oral: Escaping Saddle Points with Adaptive Gradient Methods »
Thu. Jun 13th 05:05 -- 05:10 PM Room Room 104
More from the Same Authors
-
2023 : How to escape sharp minima »
Kwangjun Ahn · Ali Jadbabaie · Suvrit Sra -
2023 : Toward Understanding Latent Model Learning in MuZero: A Case Study in Linear Quadratic Gaussian Control »
Yi Tian · Kaiqing Zhang · Russ Tedrake · Suvrit Sra -
2023 : SpecTr: Fast Speculative Decoding via Optimal Transport »
Ziteng Sun · Ananda Suresh · Jae Ro · Ahmad Beirami · Himanshu Jain · Felix Xinnan Yu · Michael Riley · Sanjiv Kumar -
2023 Poster: On the Training Instability of Shuffling SGD with Batch Normalization »
David X. Wu · Chulhee Yun · Suvrit Sra -
2023 Poster: Global optimality for Euclidean CCCP under Riemannian convexity »
Melanie Weber · Suvrit Sra -
2023 Poster: Efficient Training of Language Models using Few-Shot Learning »
Sashank Jakkam Reddi · Sobhan Miryoosefi · Stefani Karp · Shankar Krishnan · Satyen Kale · Seungyeon Kim · Sanjiv Kumar -
2022 : Sign and Basis Invariant Networks for Spectral Graph Representation Learning »
Derek Lim · Joshua Robinson · Lingxiao Zhao · Tess Smidt · Suvrit Sra · Haggai Maron · Stefanie Jegelka -
2022 Poster: In defense of dual-encoders for neural ranking »
Aditya Menon · Sadeep Jayasumana · Ankit Singh Rawat · Seungyeon Kim · Sashank Jakkam Reddi · Sanjiv Kumar -
2022 Spotlight: In defense of dual-encoders for neural ranking »
Aditya Menon · Sadeep Jayasumana · Ankit Singh Rawat · Seungyeon Kim · Sashank Jakkam Reddi · Sanjiv Kumar -
2022 Poster: Agnostic Learnability of Halfspaces via Logistic Loss »
Ziwei Ji · Kwangjun Ahn · Pranjal Awasthi · Satyen Kale · Stefani Karp -
2022 Poster: Beyond Worst-Case Analysis in Stochastic Approximation: Moment Estimation Improves Instance Complexity »
Jingzhao Zhang · Hongzhou Lin · Subhro Das · Suvrit Sra · Ali Jadbabaie -
2022 Poster: Private Adaptive Optimization with Side information »
Tian Li · Manzil Zaheer · Sashank Jakkam Reddi · Virginia Smith -
2022 Poster: Robust Training of Neural Networks Using Scale Invariant Architectures »
Zhiyuan Li · Srinadh Bhojanapalli · Manzil Zaheer · Sashank Jakkam Reddi · Sanjiv Kumar -
2022 Spotlight: Private Adaptive Optimization with Side information »
Tian Li · Manzil Zaheer · Sashank Jakkam Reddi · Virginia Smith -
2022 Oral: Agnostic Learnability of Halfspaces via Logistic Loss »
Ziwei Ji · Kwangjun Ahn · Pranjal Awasthi · Satyen Kale · Stefani Karp -
2022 Oral: Robust Training of Neural Networks Using Scale Invariant Architectures »
Zhiyuan Li · Srinadh Bhojanapalli · Manzil Zaheer · Sashank Jakkam Reddi · Sanjiv Kumar -
2022 Spotlight: Beyond Worst-Case Analysis in Stochastic Approximation: Moment Estimation Improves Instance Complexity »
Jingzhao Zhang · Hongzhou Lin · Subhro Das · Suvrit Sra · Ali Jadbabaie -
2022 Poster: Neural Network Weights Do Not Converge to Stationary Points: An Invariant Measure Perspective »
Jingzhao Zhang · Haochuan Li · Suvrit Sra · Ali Jadbabaie -
2022 Poster: Understanding the unstable convergence of gradient descent »
Kwangjun Ahn · Jingzhao Zhang · Suvrit Sra -
2022 Spotlight: Understanding the unstable convergence of gradient descent »
Kwangjun Ahn · Jingzhao Zhang · Suvrit Sra -
2022 Spotlight: Neural Network Weights Do Not Converge to Stationary Points: An Invariant Measure Perspective »
Jingzhao Zhang · Haochuan Li · Suvrit Sra · Ali Jadbabaie -
2021 Poster: Provably Efficient Algorithms for Multi-Objective Competitive RL »
Tiancheng Yu · Yi Tian · Jingzhao Zhang · Suvrit Sra -
2021 Poster: Online Learning in Unknown Markov Games »
Yi Tian · Yuanhao Wang · Tiancheng Yu · Suvrit Sra -
2021 Spotlight: Online Learning in Unknown Markov Games »
Yi Tian · Yuanhao Wang · Tiancheng Yu · Suvrit Sra -
2021 Oral: Provably Efficient Algorithms for Multi-Objective Competitive RL »
Tiancheng Yu · Yi Tian · Jingzhao Zhang · Suvrit Sra -
2021 Poster: A statistical perspective on distillation »
Aditya Menon · Ankit Singh Rawat · Sashank Jakkam Reddi · Seungyeon Kim · Sanjiv Kumar -
2021 Poster: Disentangling Sampling and Labeling Bias for Learning in Large-output Spaces »
Ankit Singh Rawat · Aditya Menon · Wittawat Jitkrittum · Sadeep Jayasumana · Felix Xinnan Yu · Sashank Jakkam Reddi · Sanjiv Kumar -
2021 Spotlight: A statistical perspective on distillation »
Aditya Menon · Ankit Singh Rawat · Sashank Jakkam Reddi · Seungyeon Kim · Sanjiv Kumar -
2021 Spotlight: Disentangling Sampling and Labeling Bias for Learning in Large-output Spaces »
Ankit Singh Rawat · Aditya Menon · Wittawat Jitkrittum · Sadeep Jayasumana · Felix Xinnan Yu · Sashank Jakkam Reddi · Sanjiv Kumar -
2021 Poster: Three Operator Splitting with a Nonconvex Loss Function »
Alp Yurtsever · Varun Mangalick · Suvrit Sra -
2021 Poster: Federated Composite Optimization »
Honglin Yuan · Manzil Zaheer · Sashank Jakkam Reddi -
2021 Spotlight: Three Operator Splitting with a Nonconvex Loss Function »
Alp Yurtsever · Varun Mangalick · Suvrit Sra -
2021 Spotlight: Federated Composite Optimization »
Honglin Yuan · Manzil Zaheer · Sashank Jakkam Reddi -
2020 Poster: Does label smoothing mitigate label noise? »
Michal Lukasik · Srinadh Bhojanapalli · Aditya Menon · Sanjiv Kumar -
2020 Poster: Low-Rank Bottleneck in Multi-head Attention Models »
Srinadh Bhojanapalli · Chulhee Yun · Ankit Singh Rawat · Sashank Jakkam Reddi · Sanjiv Kumar -
2020 Poster: Strength from Weakness: Fast Learning Using Weak Supervision »
Joshua Robinson · Stefanie Jegelka · Suvrit Sra -
2020 Poster: Complexity of Finding Stationary Points of Nonconvex Nonsmooth Functions »
Jingzhao Zhang · Hongzhou Lin · Stefanie Jegelka · Suvrit Sra · Ali Jadbabaie -
2020 Poster: Accelerating Large-Scale Inference with Anisotropic Vector Quantization »
Ruiqi Guo · Philip Sun · Erik Lindgren · Quan Geng · David Simcha · Felix Chern · Sanjiv Kumar -
2020 Poster: SCAFFOLD: Stochastic Controlled Averaging for Federated Learning »
Sai Praneeth Reddy Karimireddy · Satyen Kale · Mehryar Mohri · Sashank Jakkam Reddi · Sebastian Stich · Ananda Theertha Suresh -
2020 Poster: Federated Learning with Only Positive Labels »
Felix Xinnan Yu · Ankit Singh Rawat · Aditya Menon · Sanjiv Kumar -
2020 Poster: Learning Adversarial Markov Decision Processes with Bandit Feedback and Unknown Transition »
Chi Jin · Tiancheng Jin · Haipeng Luo · Suvrit Sra · Tiancheng Yu -
2019 : Structured matrices for efficient deep learning »
Sanjiv Kumar -
2019 Poster: Learning a Compressed Sensing Measurement Matrix via Gradient Unrolling »
Shanshan Wu · Alexandros Dimakis · Sujay Sanghavi · Felix Xinnan Yu · Daniel Holtmann-Rice · Dmitry Storcheus · Afshin Rostamizadeh · Sanjiv Kumar -
2019 Poster: Random Shuffling Beats SGD after Finite Epochs »
Jeff HaoChen · Suvrit Sra -
2019 Oral: Random Shuffling Beats SGD after Finite Epochs »
Jeff HaoChen · Suvrit Sra -
2019 Oral: Learning a Compressed Sensing Measurement Matrix via Gradient Unrolling »
Shanshan Wu · Alexandros Dimakis · Sujay Sanghavi · Felix Xinnan Yu · Daniel Holtmann-Rice · Dmitry Storcheus · Afshin Rostamizadeh · Sanjiv Kumar -
2019 Poster: Conditional Gradient Methods via Stochastic Path-Integrated Differential Estimator »
Alp Yurtsever · Suvrit Sra · Volkan Cevher -
2019 Oral: Conditional Gradient Methods via Stochastic Path-Integrated Differential Estimator »
Alp Yurtsever · Suvrit Sra · Volkan Cevher -
2018 Poster: Loss Decomposition for Fast Learning in Large Output Spaces »
En-Hsu Yen · Satyen Kale · Felix Xinnan Yu · Daniel Holtmann-Rice · Sanjiv Kumar · Pradeep Ravikumar -
2018 Oral: Loss Decomposition for Fast Learning in Large Output Spaces »
En-Hsu Yen · Satyen Kale · Felix Xinnan Yu · Daniel Holtmann-Rice · Sanjiv Kumar · Pradeep Ravikumar -
2017 Poster: Robust Budget Allocation via Continuous Submodular Functions »
Matthew J Staib · Stefanie Jegelka -
2017 Poster: Stochastic Generative Hashing »
Bo Dai · Ruiqi Guo · Sanjiv Kumar · Niao He · Le Song -
2017 Talk: Robust Budget Allocation via Continuous Submodular Functions »
Matthew J Staib · Stefanie Jegelka -
2017 Talk: Stochastic Generative Hashing »
Bo Dai · Ruiqi Guo · Sanjiv Kumar · Niao He · Le Song -
2017 Poster: Distributed Mean Estimation with Limited Communication »
Ananda Theertha Suresh · Felix Xinnan Yu · Sanjiv Kumar · Brendan McMahan -
2017 Talk: Distributed Mean Estimation with Limited Communication »
Ananda Theertha Suresh · Felix Xinnan Yu · Sanjiv Kumar · Brendan McMahan