Timezone: »
Optimization lies at the heart of many exciting developments in machine learning, statistics and signal processing. As models become more complex and datasets get larger, finding efficient, reliable and provable methods is one of the primary goals in these fields.
In the last few decades, much effort has been devoted to the development of firstorder methods. These methods enjoy a low periteration cost and have optimal complexity, are easy to implement, and have proven to be effective for most machine learning applications. Firstorder methods, however, have significant limitations: (1) they require fine hyperparameter tuning, (2) they do not incorporate curvature information, and thus are sensitive to illconditioning, and (3) they are often unable to fully exploit the power of distributed computing architectures.
Higherorder methods, such as Newton, quasiNewton and adaptive gradient descent methods, are extensively used in many scientific and engineering domains. At least in theory, these methods possess several nice features: they exploit local curvature information to mitigate the effects of illconditioning, they avoid or diminish the need for hyperparameter tuning, and they have enough concurrency to take advantage of distributed computing environments. Researchers have even developed stochastic versions of higherorder methods, that feature speed and scalability by incorporating curvature information in an economical and judicious manner. However, often higherorder methods are “undervalued.”
This workshop will attempt to shed light on this statement. Topics of interest include, but are not limited to, secondorder methods, adaptive gradient descent methods, regularization techniques, as well as techniques based on higherorder derivatives.
Sat 7:00 a.m.  7:10 a.m.

Introductory remarks
SlidesLive Video » 
Raghu Bollapragada 🔗 
Sat 7:10 a.m.  7:55 a.m.

Recent trends in regularization methods with adaptive accuracy requirements
(Plenary Talk)
SlidesLive Video » In this talk we present regularization methods employing inexact function and derivatives evaluations. They feature a very flexible adaptive mechanism for determining the inexactness which is allowed, at each iteration, when computing objective function and derivatives, in order to preserve the complexity results of their exact counterpart. The complexity analysis covers arbitrary optimality order and arbitrary degree of available approximate derivatives. In most applications the accuracy requirements can be satisfied only within a certain probability. As a first step to cope with this nondeterministic aspect, complexity results for the case of exact functions and randomly perturbed derivatives are provided. We also analyse the key computational aspects related to an efficient implementation of the method of this class employing a cubic regularized secondorder model and provide some numerical results showing the behaviour of the method. Authors: Stefania Bellavia, Gianmarco Gurioli, Benedetta Morini, Philippe Toint 
Stefania Bellavia 🔗 
Sat 7:55 a.m.  8:05 a.m.

Q&A with Stefania Bellavia (Plenary Speaker #1)
(Q&A)

🔗 
Sat 8:05 a.m.  8:50 a.m.

Conjugate gradient techniques for nonconvex optimization
(Plenary Talk)
SlidesLive Video » Nonconvex optimization formulations have recently gained popularity due to their prevalence in deep learning applications, but they also arise in numerous problems from data analysis and robust statistics. In this setting, secondorder stationary points often correspond to global optima, which motivates the development of secondorder methods with fast convergence capabilities. Meanwhile, the study of acceleration phenomena, initially proposed in convex optimization, has expanded to nonconvex formulations, and lead to revisiting of popular largescale nonlinear optimization frameworks to equip them with complexity guarantees while maintaining their practical appeal. In this talk, we will review recent advances in employing Krylov methods to accelerate optimization procedures in a nonconvex setting. We will particularly focus on linear conjugate gradient and its variations, and we will show that this method is naturally built to detect nonconvexity on quadratic problems. We will then move to a general nonconvex nonlinear optimization setting, and describe a NewtonConjugate Gradient method with best known complexity guarantees. Numerical illustrations of the performance of this method and extensions will be provided, with a focus on nonconvex instances in data science. 
Clément Royer 🔗 
Sat 8:50 a.m.  9:00 a.m.

Q&A with Clément Royer (Plenary Speaker #2)
(Q&A)

🔗 
Sat 9:00 a.m.  9:20 a.m.

Break #1
(Break)

🔗 
Sat 9:20 a.m.  10:05 a.m.

Algorithms for Deterministically Constrained Stochastic Optimization
(Plenary Talk)
SlidesLive Video » We discuss algorithms for solving optimization problems in which the objective function is defined by an expectation and the constraints are deterministic. Numerous algorithms exist for solving such problems when the constraints define a convex feasible region. By contrast, we propose algorithms for situations in which the constraints are nonlinear equations and/or inequalities that define nonconvex feasible regions. Our algorithms, for which convergence and worstcase complexity guarantees are offered in expectation, vastly outperform naive extensions of firstorder methods to this constrained setting. 
Frank E Curtis 🔗 
Sat 10:05 a.m.  10:15 a.m.

Q&A with Frank E. Curtis (Plenary Speaker #3)
(Q&A)

🔗 
Sat 10:15 a.m.  11:00 a.m.

SGD in the Large: Averagecase Analysis, Asymptotics, and Stepsize Criticality
(Plenary Talk)
SlidesLive Video » In this talk, I will present a framework, inspired by random matrix theory, for analyzing the dynamics of stochastic gradient descent (SGD) when both the number of samples and dimensions are large. Using this new framework, we show that the dynamics of SGD on a least squares problem with random data become deterministic in the large sample and dimensional limit. Furthermore, the limiting dynamics are governed by a Volterra integral equation. This model predicts that SGD undergoes a phase transition at an explicitly given critical stepsize that ultimately affects its convergence rate, which we also verify experimentally. Finally, when input data is isotropic, we provide explicit expressions for the dynamics and averagecase convergence rates. These rates show significant improvement over the worstcase complexities. 
Courtney Paquette 🔗 
Sat 11:00 a.m.  11:10 a.m.

Q&A with Courtney Paquette (Plenary Speaker #4)
(Q&A)

🔗 
Sat 11:10 a.m.  11:20 a.m.

Convergence Analysis and Implicit Regularization of Feedback Alignment for Deep Linear Networks
(Spotlight Talk)
SlidesLive Video » 
Manuela Girotti 🔗 
Sat 11:20 a.m.  11:30 a.m.

Computing the Newtonstep faster than Hessian accumulation
(Spotlight Talk)
SlidesLive Video » 
Akshay Srinivasan 🔗 
Sat 11:30 a.m.  1:00 p.m.

Break #2
(Break)

🔗 
Sat 1:00 p.m.  1:45 p.m.

Descent method framework in optimization
(Plenary Talk)
SlidesLive Video » We present a framework for going beyond first order methods we refer to as the “descent method framework.” We construct a family of algorithms using intuition from dynamical systems, and show how they obtain fast nonasymptotic convergence for various classes of smooth functions. We also present a framework for “accelerating” or adding momentum descent algorithms with better theoretical complexity guarantees. We end by illustrating how this framework can be extended to manifolds and possibly lead to new techniques and analyses that incorporate geometric properties of various types of parameter spaces. 
Ashia Wilson 🔗 
Sat 1:45 p.m.  1:55 p.m.

Q&A with Ashia Wilson (Plenary Speaker #5)
(Q&A)

🔗 
Sat 1:55 p.m.  2:40 p.m.

Faster Empirical Risk Minimization
(Plenary Talk)
SlidesLive Video » Empirical Risk Minimization (ERM) problems are central to machine learning, and their efficient optimization has been studied from different perspectives, often taking advantage of the finite sum structure present in typical problem formulations. In particular, tight oracle complexity bounds have been obtained under fairly general assumptions about the loss functions. In this talk, I will present a rather surprising and general result that takes advantage of the separability of nonsmooth convex loss functions with efficiently computable proximal operators  such as, e.g., the hinge loss and the sum of absolute errors  to obtain an algorithm that exhibits significantly lower complexity than what is predicted by the lower bounds for general nonsmooth convex losses. Crucial to this result is the use of proximal operators, which goes beyond the simple gradient oracle access to the function. The talk is based on joint work with Chaobing Song and Stephen Wright. 
Jelena Diakonikolas 🔗 
Sat 2:40 p.m.  2:50 p.m.

Q&A with Jelena Diakonikolas (Plenary Speaker #6)
(Q&A)

🔗 
Sat 2:50 p.m.  3:15 p.m.

Break #3
(Break)

🔗 
Sat 3:15 p.m.  3:25 p.m.

Regularized Newton Method with Global O(1/k^2) Convergence
(Spotlight Talk)
SlidesLive Video » 
Konstantin Mishchenko 🔗 
Sat 3:25 p.m.  3:35 p.m.

Structured secondorder methods via naturalgradient descent
(Spotlight Talk)
SlidesLive Video » 
Wu Lin 🔗 
Sat 3:35 p.m.  3:45 p.m.

Implicit Regularization in Overparameterized Bilevel Optimization
(Spotlight Talk)
SlidesLive Video » 
Paul Vicol 🔗 
Sat 3:45 p.m.  4:30 p.m.

Stochastic VarianceReduced Highorder Optimization for Nonconvex Optimization
(Plenary Talk)
SlidesLive Video » Highorder optimization methods, such as cubic regularization methods, have attracted great interest in recent years due to their power to better leverage the optimization landscape. To apply it to largescale optimization in machine learning, it is of great interest to extend it to stochastic optimization. In this talk, I will introduce a stochastic variancereduced pthorder method for finding firstorder stationary points in nonconvex finitesum optimization. Our algorithm enjoys stateoftheart complexities under several complexity measures including gradient and Hessian sample complexities. I will also introduce corresponding lower bound results to suggest the nearoptimality of our algorithm under some specific circumstances. 
Quanquan Gu 🔗 
Sat 4:30 p.m.  4:40 p.m.

Q&A with Quanquan Gu (Plenary Speaker #7)
(Q&A)

🔗 
Sat 4:40 p.m.  4:50 p.m.

Closing remarks
SlidesLive Video » 
Albert S Berahas 🔗 
Author Information
Albert S Berahas (University of Michigan)
Anastasios Kyrillidis (Rice University)
Fred Roosta (University of Queensland)
Amir Gholaminejad (University of California, Berkeley)
Michael Mahoney (UC Berkeley)
Rachael Tappenden (University of Canterbury)
Raghu Bollapragada (The University of Texas at Austin)
Rixon Crane (The University of Queensland)
J. Lyle Kim (Rice University)
More from the Same Authors

2021 : Mitigating deep double descent by concatenating inputs »
John Chen · Qihan Wang · Anastasios Kyrillidis 
2022 Poster: AutoIP: A United Framework to Integrate Physics into Gaussian Processes »
Da Long · Zheng Wang · Aditi Krishnapriyan · Robert Kirby · Shandian Zhe · Michael Mahoney 
2022 Poster: GACT: Activation Compressed Training for Generic Network Architectures »
Xiaoxuan Liu · Lianmin Zheng · Dequan Wang · Yukuo Cen · Weize Chen · Xu Han · Jianfei Chen · Zhiyuan Liu · Jie Tang · Joseph Gonzalez · Michael Mahoney · Alvin Cheung 
2022 Spotlight: AutoIP: A United Framework to Integrate Physics into Gaussian Processes »
Da Long · Zheng Wang · Aditi Krishnapriyan · Robert Kirby · Shandian Zhe · Michael Mahoney 
2022 Spotlight: GACT: Activation Compressed Training for Generic Network Architectures »
Xiaoxuan Liu · Lianmin Zheng · Dequan Wang · Yukuo Cen · Weize Chen · Xu Han · Jianfei Chen · Zhiyuan Liu · Jie Tang · Joseph Gonzalez · Michael Mahoney · Alvin Cheung 
2022 Poster: Generalization Bounds using Lower Tail Exponents in Stochastic Optimizers »
Liam Hodgkinson · Umut Simsekli · Rajiv Khanna · Michael Mahoney 
2022 Poster: Fat–Tailed Variational Inference with Anisotropic Tail Adaptive Flows »
Feynman Liang · Michael Mahoney · Liam Hodgkinson 
2022 Poster: Neurotoxin: Durable Backdoors in Federated Learning »
Zhengming Zhang · Ashwinee Panda · Linyue Song · Yaoqing Yang · Michael Mahoney · Prateek Mittal · Kannan Ramchandran · Joseph E Gonzalez 
2022 Spotlight: Neurotoxin: Durable Backdoors in Federated Learning »
Zhengming Zhang · Ashwinee Panda · Linyue Song · Yaoqing Yang · Michael Mahoney · Prateek Mittal · Kannan Ramchandran · Joseph E Gonzalez 
2022 Spotlight: Generalization Bounds using Lower Tail Exponents in Stochastic Optimizers »
Liam Hodgkinson · Umut Simsekli · Rajiv Khanna · Michael Mahoney 
2022 Spotlight: Fat–Tailed Variational Inference with Anisotropic Tail Adaptive Flows »
Feynman Liang · Michael Mahoney · Liam Hodgkinson 
2021 : Closing remarks »
Albert S Berahas 
2021 : Introductory remarks »
Raghu Bollapragada 
2021 Poster: IBERT: Integeronly BERT Quantization »
Sehoon Kim · Amir Gholaminejad · Zhewei Yao · Michael Mahoney · EECS Kurt Keutzer 
2021 Poster: HAWQV3: Dyadic Neural Network Quantization »
Zhewei Yao · Zhen Dong · Zhangcheng Zheng · Amir Gholaminejad · Jiali Yu · Eric Tan · Leyuan Wang · Qijing Huang · Yida Wang · Michael Mahoney · EECS Kurt Keutzer 
2021 Spotlight: HAWQV3: Dyadic Neural Network Quantization »
Zhewei Yao · Zhen Dong · Zhangcheng Zheng · Amir Gholaminejad · Jiali Yu · Eric Tan · Leyuan Wang · Qijing Huang · Yida Wang · Michael Mahoney · EECS Kurt Keutzer 
2021 Oral: IBERT: Integeronly BERT Quantization »
Sehoon Kim · Amir Gholaminejad · Zhewei Yao · Michael Mahoney · EECS Kurt Keutzer 
2021 Poster: ActNN: Reducing Training Memory Footprint via 2Bit Activation Compressed Training »
Jianfei Chen · Lianmin Zheng · Zhewei Yao · Dequan Wang · Ion Stoica · Michael Mahoney · Joseph E Gonzalez 
2021 Oral: ActNN: Reducing Training Memory Footprint via 2Bit Activation Compressed Training »
Jianfei Chen · Lianmin Zheng · Zhewei Yao · Dequan Wang · Ion Stoica · Michael Mahoney · Joseph E Gonzalez 
2021 Poster: Multiplicative Noise and Heavy Tails in Stochastic Optimization »
Liam Hodgkinson · Michael Mahoney 
2021 Spotlight: Multiplicative Noise and Heavy Tails in Stochastic Optimization »
Liam Hodgkinson · Michael Mahoney 
2020 : Determinantal Point Processes in Randomized Numerical Linear Algebra »
Michael Mahoney 
2020 : Spotlight talk 3  PyHessian: Neural Networks Through the Lens of the Hessian »
Amir Gholaminejad 
2020 Workshop: Beyond first order methods in machine learning systems »
Albert S Berahas · Amir Gholaminejad · Anastasios Kyrillidis · Michael Mahoney · Fred Roosta 
2020 Poster: Forecasting Sequential Data Using Consistent Koopman Autoencoders »
Omri Azencot · N. Benjamin Erichson · Vanessa Lin · Michael Mahoney 
2020 Poster: PowerNorm: Rethinking Batch Normalization in Transformers »
Sheng Shen · Zhewei Yao · Amir Gholaminejad · Michael Mahoney · Kurt Keutzer 
2020 Poster: DINO: Distributed NewtonType Optimization Method »
Rixon Crane · Fred Roosta 
2020 Poster: Negative Sampling in SemiSupervised learning »
John Chen · Vatsal Shah · Anastasios Kyrillidis 
2020 Poster: Error Estimation for Sketched SVD via the Bootstrap »
Miles Lopes · N. Benjamin Erichson · Michael Mahoney 
2019 : Poster discussion »
Roman Novak · Maxime Gabella · Frederic Dreyer · Siavash Golkar · Anh Tong · Irina Higgins · Mirco Milletari · Joe Antognini · Sebastian Goldt · Adín Ramírez Rivera · Roberto Bondesan · Ryo Karakida · Remi Tachet des Combes · Michael Mahoney · Nicholas Walker · Stanislav Fort · Samuel Smith · Rohan Ghosh · Aristide Baratin · Diego Granziol · Stephen Roberts · Dmitry Vetrov · Andrew Wilson · César Laurent · Valentin Thomas · Simon LacosteJulien · Dar Gilboa · Daniel Soudry · Anupam Gupta · Anirudh Goyal · Yoshua Bengio · Erich Elsen · Soham De · Stanislaw Jastrzebski · Charles H Martin · Samira Shabanian · Aaron Courville · Shorato Akaho · Lenka Zdeborova · Ethan Dyer · Maurice Weiler · Pim de Haan · Taco Cohen · Max Welling · Ping Luo · zhanglin peng · Nasim Rahaman · Loic Matthey · Danilo J. Rezende · Jaesik Choi · Kyle Cranmer · Lechao Xiao · Jaehoon Lee · Yasaman Bahri · Jeffrey Pennington · Greg Yang · Jiri Hron · Jascha SohlDickstein · Guy GurAri 
2019 : Why Deep Learning Works: Traditional and HeavyTailed Implicit SelfRegularization in Deep Neural Networks »
Michael Mahoney 
2019 Poster: Traditional and Heavy Tailed Self Regularization in Neural Network Models »
Michael Mahoney · Charles H Martin 
2019 Oral: Traditional and Heavy Tailed Self Regularization in Neural Network Models »
Michael Mahoney · Charles H Martin 
2019 Poster: Compressing Gradient Optimizers via CountSketches »
Ryan Spring · Anastasios Kyrillidis · Vijai Mohan · Anshumali Shrivastava 
2019 Oral: Compressing Gradient Optimizers via CountSketches »
Ryan Spring · Anastasios Kyrillidis · Vijai Mohan · Anshumali Shrivastava 
2018 Poster: Outofsample extension of graph adjacency spectral embedding »
Keith Levin · Fred Roosta · Michael Mahoney · Carey Priebe 
2018 Oral: Outofsample extension of graph adjacency spectral embedding »
Keith Levin · Fred Roosta · Michael Mahoney · Carey Priebe 
2018 Poster: Error Estimation for Randomized LeastSquares Algorithms via the Bootstrap »
Miles Lopes · Shusen Wang · Michael Mahoney 
2018 Oral: Error Estimation for Randomized LeastSquares Algorithms via the Bootstrap »
Miles Lopes · Shusen Wang · Michael Mahoney 
2017 Poster: Sketched Ridge Regression: Optimization Perspective, Statistical Perspective, and Model Averaging »
Shusen Wang · Alex Gittens · Michael Mahoney 
2017 Poster: Capacity Releasing Diffusion for Speed and Locality. »
Di Wang · Kimon Fountoulakis · Monika Henzinger · Michael Mahoney · Satish Rao 
2017 Talk: Capacity Releasing Diffusion for Speed and Locality. »
Di Wang · Kimon Fountoulakis · Monika Henzinger · Michael Mahoney · Satish Rao 
2017 Talk: Sketched Ridge Regression: Optimization Perspective, Statistical Perspective, and Model Averaging »
Shusen Wang · Alex Gittens · Michael Mahoney