Timezone: »
Optimization lies at the heart of many exciting developments in machine learning, statistics and signal processing. As models become more complex and datasets get larger, finding efficient, reliable and provable methods is one of the primary goals in these fields.
In the last few decades, much effort has been devoted to the development of firstorder methods. These methods enjoy a low periteration cost and have optimal complexity, are easy to implement, and have proven to be effective for most machine learning applications. Firstorder methods, however, have significant limitations: (1) they require fine hyperparameter tuning, (2) they do not incorporate curvature information, and thus are sensitive to illconditioning, and (3) they are often unable to fully exploit the power of distributed computing architectures.
Higherorder methods, such as Newton, quasiNewton and adaptive gradient descent methods, are extensively used in many scientific and engineering domains. At least in theory, these methods possess several nice features: they exploit local curvature information to mitigate the effects of illconditioning, they avoid or diminish the need for hyperparameter tuning, and they have enough concurrency to take advantage of distributed computing environments. Researchers have even developed stochastic versions of higherorder methods, that feature speed and scalability by incorporating curvature information in an economical and judicious manner. However, often higherorder methods are “undervalued.”
This workshop will attempt to shed light on this statement. Topics of interest include, but are not limited to, secondorder methods, adaptive gradient descent methods, regularization techniques, as well as techniques based on higherorder derivatives.
Fri 8:00 a.m.  8:15 a.m.

Introductory notes
(Talk)
Introductory notes for the workshop 
🔗 
Fri 8:15 a.m.  9:00 a.m.

Talk by Peter Richtarik  Fast linear convergence of randomized BFGS
(Talk)
Since the late 1950's when quasiNewton methods first appeared, they have become one of the most widely used and efficient algorithmic paradigms for unconstrained optimization. Despite their immense practical success, there is little theory that shows why these methods are so efficient. We provide a semilocal rate of convergence for the randomized BFGS method which can be significantly better than that of gradient descent, finally giving theoretical evidence supporting the superior empirical performance of the method. 
Peter Richtarik 🔗 
Fri 9:00 a.m.  9:10 a.m.

Q&A with Peter Richtarik
(Discussion)

Peter Richtarik 🔗 
Fri 9:10 a.m.  9:55 a.m.

Talk by Francis Bach  Second Order Strikes Back  Globally convergent Newton methods for illconditioned generalized selfconcordant Losses
(Talk)
SlidesLive Video » We will study largescale convex optimization algorithms based on the Newton method applied to regularized generalized selfconcordant losses, which include logistic regression and softmax regression. We first prove that our new simple scheme based on a sequence of problems with decreasing regularization parameters is provably globally convergent, that this convergence is linear with a constant factor which scales only logarithmically with the condition number. In the parametric setting, we obtain an algorithm with the same scaling than regular firstorder methods but with an improved behavior, in particular in illconditioned problems. Second, in the non parametric machine learning setting, we provide an explicit algorithm combining the previous scheme with Nyström projection techniques, and prove that it achieves optimal generalization bounds with a time complexity of order O(n\sqrt{n}), a memory complexity of order O(n) and no dependence on the condition number, generalizing the results known for leastsquares regression (Joint work with Ulysse MarteauFerey and Alessandro Rudi, https://arxiv.org/abs/1907.01771). 
Francis Bach 🔗 
Fri 9:55 a.m.  10:05 a.m.

Q&A with Francis Bach
(Discussion)

Francis Bach 🔗 
Fri 10:05 a.m.  10:30 a.m.

Break until 10:30am (PDT)
(Break)

🔗 
Fri 10:30 a.m.  10:40 a.m.

Spotlight talk 1  A SecondOrder Optimization Algorithm for Solving Problems Involving Group Sparse Regularization
(Talk)
We consider the problem of minimizing the sum of a convex function and a sparsityinducing group regularizer. Problems involving such regularizers arise in modern machine learning applications often for the purpose of obtaining models that are easier to interpret and that have higher predictive accuracy. We present a new secondorder algorithm for solving such problems that uses subspace acceleration, domain decomposition, and support identification. Our analysis shows that our algorithm has strong complexity properties, and preliminary numerical results for binary classification based on regularized logistic regression show that our approach is efficient and robust. 
Daniel Robinson 🔗 
Fri 10:40 a.m.  10:50 a.m.

Spotlight talk 2  Ridge Riding: Finding diverse solutions by following eigenvectors of the Hessian
(Talk)
Over the last decade, a single algorithm has changed many facets of our lives  Stochastic Gradient Descent (SGD). In the era of ever decreasing loss functions, SGD and its various offspring are a key component of the success of deep neural networks. However, in some cases it may matter which local optimum is found, and this is often contextdependent. Examples frequently arise in machine learning, from shapeversustexturefeatures to ensemble methods and zeroshot coordination. In these settings, there are desired types of solutions which SGD on `standard' loss functions will not find. In this paper, we present a different approach. Rather than following the a locally greedy gradient, we instead follow the eigenvectors of the Hessian. We call these 'ridges'. By iteratively following and branching amongst the ridges, we effectively span the loss surface to find qualitatively different solutions. We show both theoretically and experimentally that our method, called Ridge Riding, offers a promising direction. 
Jack ParkerHolder 🔗 
Fri 10:50 a.m.  11:00 a.m.

Spotlight talk 3  PyHessian: Neural Networks Through the Lens of the Hessian
(Talk)
SlidesLive Video » We present PyHessian, a new scalable framework that enables fast computation of Hessian (i.e., secondorder derivative) information for deep neural networks. PyHessianenables fast computation of the top Hessian eigenvalues, the Hessian trace, and the full Hessian eigenvalue/spectral density, and supports distributedmemory execution on cloud/supercomputer systems and available as open source. We show that this framework can be used to analyze neural network models, including the topology of the loss landscape (i.e., curvature information) to gain insight into the behavior of different models/optimizers. In particular, we analyze the effect of Batch Normalization layers on the trainability of NNs. We find that Batch Normalization does not necessarily make the loss landscape smoother, especially for shallow networks, as opposed to common belief. 
Amir Gholaminejad 🔗 
Fri 11:00 a.m.  11:45 a.m.

Talk by Coralia Cartis  Dimensionality reduction techniques for largescale optimization problems
(Talk)
Known by many names, sketching techniques allow random projections of data from high to low dimensions while preserving pairwise distances. This talk explores ways to use sketching so as to improve the scalability of algorithms for diverse classes of optimization problems and applications, from linear to nonlinear, local to global, derivativebased to derivativefree. Regression problems and GaussNewton techniques will receive particular attention. Numerical illustrations on standard optimization test problems as well as on some machine learning setups will be presented. This work is joint with Jan Fiala (NAG Ltd), Jaroslav Fowkes (Oxford), Estelle Massart (Oxford and NPL), Adilet Otemissov (Oxford and Turing), Alex Puiu (Oxford), Lindon Roberts (Australian National University, Canberra), Zhen Shao (Oxford). 
Coralia Cartis 🔗 
Fri 11:45 a.m.  11:55 a.m.

Q&A with Coralia Cartis
(Discussion)

Coralia Cartis 🔗 
Fri 11:55 a.m.  1:30 p.m.

Break until 13:30pm (PDT)
(Break)

🔗 
Fri 1:30 p.m.  1:40 p.m.

Spotlight talk 4  MomentumRNN: Integrating Momentum into Recurrent Neural Networks
(Talk)
Designing deep neural networks is an art that often involves an expensive search over candidate architectures. To overcome this for recurrent neural nets (RNNs), we establish a connection between the hidden state dynamics in an RNN and gradient descent (GD). We then integrate momentum into this framework and propose a new family of RNNs, called {\em MomentumRNNs}. We theoretically prove and numerically demonstrate that MomentumRNNs alleviate the vanishing gradient issue in training RNNs. We study the momentum longshort term memory (MomentumLSTM) and verify its advantages in convergence speed and accuracy over its LSTM counterpart across a variety of benchmarks, with little compromise in computational or memory efficiency. We also demonstrate that MomentumRNN is applicable to many types of recurrent cells, including those in the stateoftheart orthogonal RNNs. Finally, we show that other advanced momentumbased optimization methods, such as Adam and Nesterov accelerated gradients with a restart, can be easily incorporated into the MomentumRNN framework for designing new recurrent cells with even better performance. 
Minh Nguyen 🔗 
Fri 1:40 p.m.  1:50 p.m.

Spotlight talk 5  Stepsize Adaptation Using Exponentiated Gradient Updates
(Talk)
SlidesLive Video » Optimizers like Adam and AdaGrad have been very successful in training largescale neural networks. Yet, the performance of these methods is heavily dependent on a carefully tuned learning rate schedule. We show that in many largescale applications, augmenting a given optimizer with an adaptive tuning method of the stepsize greatly improves the performance. More precisely, we maintain a global stepsize scale for the update as well as a gain factor for each coordinate. We adjust the global scale based on the alignment of the average gradient and the current gradient vectors. A similar approach is used for updating the local gain factors. This type of stepsize scale tuning has been done before with gradient descent updates. In this paper, we update the stepsize scale and the gain variables with exponentiated gradient updates instead. Experimentally, we show that our approach can achieve compelling accuracy on standard models without using any specially tuned learning rate schedule. We also show the effectiveness of our approach for quickly adapting to distribution shifts in the data during training. 
Ehsan Amid 🔗 
Fri 1:50 p.m.  2:00 p.m.

Spotlight talk 6  Competitive Mirror Descent
(Talk)
SlidesLive Video » Constrained competitive optimization involves multiple agents trying to minimize conflicting objectives, subject to constraints. This is a highly expressive modeling language that subsumes most of modern machine learning.In this work we propose competitive mirror descent (CMD): a general method for solving such problems based on first order information that can be obtained by automatic differentiation.First, by adding Lagrange multipliers, we obtain a simplified constraint set with an associated Bregman potential.At each iteration, we then solve for the Nash equilibrium of a regularized bilinear approximation of the full problem to obtain a direction of movement of the agents.Finally, we obtain the next iterate by following this direction according to the dual geometry induced by the Bregman potential.By using the dual geometry we obtain feasible iterates despite only solving a linear system at each iteration, eliminating the need for projection steps while still accounting for the global nonlinear structure of the constraint set.As a special case we obtain a novel competitive multiplicative weights algorithm for problems on the positive cone. 
Florian Schaefer 🔗 
Fri 2:00 p.m.  2:15 p.m.

Industry Panel  Talk by Boris Ginsburg  Large scale deep learning: new trends and optimization challenges
(Talk)
SlidesLive Video » I will discuss two major trends in the deep learning. The first trend is an exponential growth in the size of models: from 340M (BERTlarge) in 2018 to 175B (GPT3) in 2020. We need new, more memory efficient algorithms to train such huge models. The second trend is “BERTapproach”, when a model is first pretrained in unsupervised or selfsupervised manner on large unlabeled dataset, and then it is finetuned for another task using a. smaller labeled dataset. This trend sets new theoretical problems. Next, I will discuss a practical need in theoretical foundation for regularization methods used in the deep learning practice: data augmentation, dropout, label smoothing etc. Finally, I will describe an applicationdriven design of new optimization methods using NovoGrad as example. 
Boris Ginsburg 🔗 
Fri 2:15 p.m.  2:30 p.m.

Industry Panel  Talk by Jonathan Hseu  ML Models in Production
(Talk)
SlidesLive Video » We discuss the difficulties of training and improving ML models on large datasets in production. We also go into the process of an engineer working on ML models, and the challenges of trading off cost, model quality, and performance. Finally, we go into a wishlist of optimization improvements that could improve the workflow of engineers working on these models. 
Jonathan Hseu 🔗 
Fri 2:30 p.m.  2:45 p.m.

Industry Panel  Talk by Andres Rodriguez  Shifting the DL industry to 2nd order methods
(Talk)
SlidesLive Video » In this talk, we review the topology design process used by data scientists and explain why 1st order methods are computational expensive in the design process. We explore the benefits of 2nd order methods to reduce the topology design cost and highlight recent work that approximates the inverse Hessian. We conclude with recommendations to accelerate the adoption of these methods in the DL ecosystem. 
Andres Rodriguez 🔗 
Fri 2:45 p.m.  3:00 p.m.

Industry Panel  Talk by Lin Xiao  Statistical Adaptive Stochastic Gradient Methods
(Talk)
SlidesLive Video » Stochastic gradient descent (SGD) and its many variants serve as the workhorses of deep learning. One of the foremost pain points in using these methods in practice is hyperparameter tuning, especially the learning rate (step size). We propose a statistical adaptive procedure called SALSA to automatically schedule the learning rate for a broad family of stochastic gradient methods. SALSA first uses a smoothed linesearch procedure to find a good initial learning rate, then automatically switches to a statistical method, which detects stationarity of the learning process under a fixed learning rate, and drops the learning rate by a constant factor whenever stationarity is detected. The combined procedure is highly robust and autonomous, and it matches the performance of the best handtuned methods in several popular deep learning tasks. 
Lin Xiao 🔗 
Fri 3:00 p.m.  3:30 p.m.

Industry panel Q&A
(Discussion)

🔗 
Fri 3:30 p.m.  4:15 p.m.

Talk by Rachel Ward  Weighted Optimization: better generalization by smoother interpolation
(Talk)
We provide a rigorous analysis of how implicit bias towards smooth interpolations leads to low generalization error in the overparameterized setting. We provide the first case study of this connection through a random Fourier series model and weighted least squares. We then argue through this model and numerical experiments that normalization methods in deep learning such as weight normalization improve generalization in overparameterized neural networks by implicitly encouraging smooth interpolants. This is work with Yuege (Gail) Xie, Holger Rauhut, and HungHsu Chou. 
Rachel Ward 🔗 
Fri 4:15 p.m.  4:25 p.m.

Q&A with Rachel Ward
(Discussion)

Rachel Ward 🔗 
Fri 4:25 p.m.  5:00 p.m.

Break until 17:00pm (PDT)
(Break)

🔗 
Fri 5:00 p.m.  5:45 p.m.

Talk by Rio Yokota  Degree of Approximation and Overhead of Computing Curvature, Information, and Noise Matrices
(Talk)
SlidesLive Video » Hessian, Fisher, and Covariance matrices are not only used for preconditioning optimizers, but also in generalization metrics, predicting hyperparameters, and Bayesian inference. These matrices contain valuable information that can advance theory in statistical learning, but they are very expensive to compute exactly for modern deep neural networks with billions of parameters. We make use of a highly optimized implementation for computing these matrices with various degrees of approximation to close the gap between theory and practice. We are able to significantly reduce the overhead of computing these matrices through a hybrid dataparallel + modelparallel approach. 
Rio Yokota 🔗 
Fri 5:45 p.m.  5:55 p.m.

Q&A with Rio Yokota
(Discussion)

Rio Yokota 🔗 
Fri 5:55 p.m.  6:00 p.m.

Closing remarks
(Discussion)

🔗 
Author Information
Albert S Berahas (Lehigh University)
Amir Gholaminejad (University of California, Berkeley)
Anastasios Kyrillidis (Rice University)
Michael Mahoney (UC Berkeley)
Fred Roosta (University of Queensland)
More from the Same Authors

2021 : Mitigating deep double descent by concatenating inputs »
John Chen · Qihan Wang · Anastasios Kyrillidis 
2022 Poster: Generalization Bounds using Lower Tail Exponents in Stochastic Optimizers »
Liam Hodgkinson · Umut Simsekli · Rajiv Khanna · Michael Mahoney 
2022 Spotlight: Generalization Bounds using Lower Tail Exponents in Stochastic Optimizers »
Liam Hodgkinson · Umut Simsekli · Rajiv Khanna · Michael Mahoney 
2022 Poster: AutoIP: A United Framework to Integrate Physics into Gaussian Processes »
Da Long · Zheng Wang · Aditi Krishnapriyan · Robert Kirby · Shandian Zhe · Michael Mahoney 
2022 Spotlight: AutoIP: A United Framework to Integrate Physics into Gaussian Processes »
Da Long · Zheng Wang · Aditi Krishnapriyan · Robert Kirby · Shandian Zhe · Michael Mahoney 
2022 Poster: Fat–Tailed Variational Inference with Anisotropic Tail Adaptive Flows »
Feynman Liang · Michael Mahoney · Liam Hodgkinson 
2022 Spotlight: Fat–Tailed Variational Inference with Anisotropic Tail Adaptive Flows »
Feynman Liang · Michael Mahoney · Liam Hodgkinson 
2022 Poster: GACT: Activation Compressed Training for Generic Network Architectures »
Xiaoxuan Liu · Lianmin Zheng · Dequan Wang · Yukuo Cen · Weize Chen · Xu Han · Jianfei Chen · Zhiyuan Liu · Jie Tang · Joseph Gonzalez · Michael Mahoney · Alvin Cheung 
2022 Spotlight: GACT: Activation Compressed Training for Generic Network Architectures »
Xiaoxuan Liu · Lianmin Zheng · Dequan Wang · Yukuo Cen · Weize Chen · Xu Han · Jianfei Chen · Zhiyuan Liu · Jie Tang · Joseph Gonzalez · Michael Mahoney · Alvin Cheung 
2022 Poster: Neurotoxin: Durable Backdoors in Federated Learning »
Zhengming Zhang · Ashwinee Panda · Linyue Song · Yaoqing Yang · Michael Mahoney · Prateek Mittal · Kannan Ramchandran · Joseph E Gonzalez 
2022 Spotlight: Neurotoxin: Durable Backdoors in Federated Learning »
Zhengming Zhang · Ashwinee Panda · Linyue Song · Yaoqing Yang · Michael Mahoney · Prateek Mittal · Kannan Ramchandran · Joseph E Gonzalez 
2021 Workshop: Beyond firstorder methods in machine learning systems »
Albert S Berahas · Anastasios Kyrillidis · Fred Roosta · Amir Gholaminejad · Michael Mahoney · Rachael Tappenden · Raghu Bollapragada · Rixon Crane · J. Lyle Kim 
2021 Poster: IBERT: Integeronly BERT Quantization »
Sehoon Kim · Amir Gholaminejad · Zhewei Yao · Michael Mahoney · EECS Kurt Keutzer 
2021 Poster: HAWQV3: Dyadic Neural Network Quantization »
Zhewei Yao · Zhen Dong · Zhangcheng Zheng · Amir Gholaminejad · Jiali Yu · Eric Tan · Leyuan Wang · Qijing Huang · Yida Wang · Michael Mahoney · EECS Kurt Keutzer 
2021 Spotlight: HAWQV3: Dyadic Neural Network Quantization »
Zhewei Yao · Zhen Dong · Zhangcheng Zheng · Amir Gholaminejad · Jiali Yu · Eric Tan · Leyuan Wang · Qijing Huang · Yida Wang · Michael Mahoney · EECS Kurt Keutzer 
2021 Oral: IBERT: Integeronly BERT Quantization »
Sehoon Kim · Amir Gholaminejad · Zhewei Yao · Michael Mahoney · EECS Kurt Keutzer 
2021 Poster: ActNN: Reducing Training Memory Footprint via 2Bit Activation Compressed Training »
Jianfei Chen · Lianmin Zheng · Zhewei Yao · Dequan Wang · Ion Stoica · Michael Mahoney · Joseph E Gonzalez 
2021 Oral: ActNN: Reducing Training Memory Footprint via 2Bit Activation Compressed Training »
Jianfei Chen · Lianmin Zheng · Zhewei Yao · Dequan Wang · Ion Stoica · Michael Mahoney · Joseph E Gonzalez 
2021 Poster: Multiplicative Noise and Heavy Tails in Stochastic Optimization »
Liam Hodgkinson · Michael Mahoney 
2021 Spotlight: Multiplicative Noise and Heavy Tails in Stochastic Optimization »
Liam Hodgkinson · Michael Mahoney 
2020 : Determinantal Point Processes in Randomized Numerical Linear Algebra »
Michael Mahoney 
2020 : Spotlight talk 3  PyHessian: Neural Networks Through the Lens of the Hessian »
Amir Gholaminejad 
2020 Poster: Forecasting Sequential Data Using Consistent Koopman Autoencoders »
Omri Azencot · N. Benjamin Erichson · Vanessa Lin · Michael Mahoney 
2020 Poster: PowerNorm: Rethinking Batch Normalization in Transformers »
Sheng Shen · Zhewei Yao · Amir Gholaminejad · Michael Mahoney · Kurt Keutzer 
2020 Poster: DINO: Distributed NewtonType Optimization Method »
Rixon Crane · Fred Roosta 
2020 Poster: Negative Sampling in SemiSupervised learning »
John Chen · Vatsal Shah · Anastasios Kyrillidis 
2020 Poster: Error Estimation for Sketched SVD via the Bootstrap »
Miles Lopes · N. Benjamin Erichson · Michael Mahoney 
2019 : Poster discussion »
Roman Novak · Maxime Gabella · Frederic Dreyer · Siavash Golkar · Anh Tong · Irina Higgins · Mirco Milletari · Joe Antognini · Sebastian Goldt · Adín Ramírez Rivera · Roberto Bondesan · Ryo Karakida · Remi Tachet des Combes · Michael Mahoney · Nicholas Walker · Stanislav Fort · Samuel Smith · Rohan Ghosh · Aristide Baratin · Diego Granziol · Stephen Roberts · Dmitry Vetrov · Andrew Wilson · César Laurent · Valentin Thomas · Simon LacosteJulien · Dar Gilboa · Daniel Soudry · Anupam Gupta · Anirudh Goyal · Yoshua Bengio · Erich Elsen · Soham De · Stanislaw Jastrzebski · Charles H Martin · Samira Shabanian · Aaron Courville · Shorato Akaho · Lenka Zdeborova · Ethan Dyer · Maurice Weiler · Pim de Haan · Taco Cohen · Max Welling · Ping Luo · zhanglin peng · Nasim Rahaman · Loic Matthey · Danilo J. Rezende · Jaesik Choi · Kyle Cranmer · Lechao Xiao · Jaehoon Lee · Yasaman Bahri · Jeffrey Pennington · Greg Yang · Jiri Hron · Jascha SohlDickstein · Guy GurAri 
2019 : Why Deep Learning Works: Traditional and HeavyTailed Implicit SelfRegularization in Deep Neural Networks »
Michael Mahoney 
2019 Poster: Traditional and Heavy Tailed Self Regularization in Neural Network Models »
Michael Mahoney · Charles H Martin 
2019 Oral: Traditional and Heavy Tailed Self Regularization in Neural Network Models »
Michael Mahoney · Charles H Martin 
2019 Poster: Compressing Gradient Optimizers via CountSketches »
Ryan Spring · Anastasios Kyrillidis · Vijai Mohan · Anshumali Shrivastava 
2019 Oral: Compressing Gradient Optimizers via CountSketches »
Ryan Spring · Anastasios Kyrillidis · Vijai Mohan · Anshumali Shrivastava 
2018 Poster: Outofsample extension of graph adjacency spectral embedding »
Keith Levin · Fred Roosta · Michael Mahoney · Carey Priebe 
2018 Oral: Outofsample extension of graph adjacency spectral embedding »
Keith Levin · Fred Roosta · Michael Mahoney · Carey Priebe 
2018 Poster: Error Estimation for Randomized LeastSquares Algorithms via the Bootstrap »
Miles Lopes · Shusen Wang · Michael Mahoney 
2018 Oral: Error Estimation for Randomized LeastSquares Algorithms via the Bootstrap »
Miles Lopes · Shusen Wang · Michael Mahoney 
2017 Poster: Sketched Ridge Regression: Optimization Perspective, Statistical Perspective, and Model Averaging »
Shusen Wang · Alex Gittens · Michael Mahoney 
2017 Poster: Capacity Releasing Diffusion for Speed and Locality. »
Di Wang · Kimon Fountoulakis · Monika Henzinger · Michael Mahoney · Satish Rao 
2017 Talk: Capacity Releasing Diffusion for Speed and Locality. »
Di Wang · Kimon Fountoulakis · Monika Henzinger · Michael Mahoney · Satish Rao 
2017 Talk: Sketched Ridge Regression: Optimization Perspective, Statistical Perspective, and Model Averaging »
Shusen Wang · Alex Gittens · Michael Mahoney