Workshop
Beyond first order methods in machine learning systems
Albert S Berahas · Amir Gholaminejad · Anastasios Kyrillidis · Michael Mahoney · Fred Roosta
Keywords: Optimization Second Order Methods Stochastic Gradient Descent
Optimization lies at the heart of many exciting developments in machine learning, statistics and signal processing. As models become more complex and datasets get larger, finding efficient, reliable and provable methods is one of the primary goals in these fields.
In the last few decades, much effort has been devoted to the development of firstorder methods. These methods enjoy a low periteration cost and have optimal complexity, are easy to implement, and have proven to be effective for most machine learning applications. Firstorder methods, however, have significant limitations: (1) they require fine hyperparameter tuning, (2) they do not incorporate curvature information, and thus are sensitive to illconditioning, and (3) they are often unable to fully exploit the power of distributed computing architectures.
Higherorder methods, such as Newton, quasiNewton and adaptive gradient descent methods, are extensively used in many scientific and engineering domains. At least in theory, these methods possess several nice features: they exploit local curvature information to mitigate the effects of illconditioning, they avoid or diminish the need for hyperparameter tuning, and they have enough concurrency to take advantage of distributed computing environments. Researchers have even developed stochastic versions of higherorder methods, that feature speed and scalability by incorporating curvature information in an economical and judicious manner. However, often higherorder methods are “undervalued.”
This workshop will attempt to shed light on this statement. Topics of interest include, but are not limited to, secondorder methods, adaptive gradient descent methods, regularization techniques, as well as techniques based on higherorder derivatives.
Schedule
Fri 8:00 a.m.  8:15 a.m.

Introductory notes
(
Talk
)
Introductory notes for the workshop 
🔗 
Fri 8:15 a.m.  9:00 a.m.

Talk by Peter Richtarik  Fast linear convergence of randomized BFGS
(
Talk
)
Since the late 1950's when quasiNewton methods first appeared, they have become one of the most widely used and efficient algorithmic paradigms for unconstrained optimization. Despite their immense practical success, there is little theory that shows why these methods are so efficient. We provide a semilocal rate of convergence for the randomized BFGS method which can be significantly better than that of gradient descent, finally giving theoretical evidence supporting the superior empirical performance of the method. 
Peter Richtarik 🔗 
Fri 9:00 a.m.  9:10 a.m.

Q&A with Peter Richtarik
(
Discussion
)

Peter Richtarik 🔗 
Fri 9:10 a.m.  9:55 a.m.

Talk by Francis Bach  Second Order Strikes Back  Globally convergent Newton methods for illconditioned generalized selfconcordant Losses
(
Talk
)
SlidesLive Video We will study largescale convex optimization algorithms based on the Newton method applied to regularized generalized selfconcordant losses, which include logistic regression and softmax regression. We first prove that our new simple scheme based on a sequence of problems with decreasing regularization parameters is provably globally convergent, that this convergence is linear with a constant factor which scales only logarithmically with the condition number. In the parametric setting, we obtain an algorithm with the same scaling than regular firstorder methods but with an improved behavior, in particular in illconditioned problems. Second, in the non parametric machine learning setting, we provide an explicit algorithm combining the previous scheme with Nyström projection techniques, and prove that it achieves optimal generalization bounds with a time complexity of order O(n\sqrt{n}), a memory complexity of order O(n) and no dependence on the condition number, generalizing the results known for leastsquares regression (Joint work with Ulysse MarteauFerey and Alessandro Rudi, https://arxiv.org/abs/1907.01771). 
Francis Bach 🔗 
Fri 9:55 a.m.  10:05 a.m.

Q&A with Francis Bach
(
Discussion
)

Francis Bach 🔗 
Fri 10:05 a.m.  10:30 a.m.

Break until 10:30am (PDT)
(
Break
)

🔗 
Fri 10:30 a.m.  10:40 a.m.

Spotlight talk 1  A SecondOrder Optimization Algorithm for Solving Problems Involving Group Sparse Regularization
(
Talk
)
We consider the problem of minimizing the sum of a convex function and a sparsityinducing group regularizer. Problems involving such regularizers arise in modern machine learning applications often for the purpose of obtaining models that are easier to interpret and that have higher predictive accuracy. We present a new secondorder algorithm for solving such problems that uses subspace acceleration, domain decomposition, and support identification. Our analysis shows that our algorithm has strong complexity properties, and preliminary numerical results for binary classification based on regularized logistic regression show that our approach is efficient and robust. 
Daniel Robinson 🔗 
Fri 10:40 a.m.  10:50 a.m.

Spotlight talk 2  Ridge Riding: Finding diverse solutions by following eigenvectors of the Hessian
(
Talk
)
Over the last decade, a single algorithm has changed many facets of our lives  Stochastic Gradient Descent (SGD). In the era of ever decreasing loss functions, SGD and its various offspring are a key component of the success of deep neural networks. However, in some cases it may matter which local optimum is found, and this is often contextdependent. Examples frequently arise in machine learning, from shapeversustexturefeatures to ensemble methods and zeroshot coordination. In these settings, there are desired types of solutions which SGD on `standard' loss functions will not find. In this paper, we present a different approach. Rather than following the a locally greedy gradient, we instead follow the eigenvectors of the Hessian. We call these 'ridges'. By iteratively following and branching amongst the ridges, we effectively span the loss surface to find qualitatively different solutions. We show both theoretically and experimentally that our method, called Ridge Riding, offers a promising direction. 
Jack ParkerHolder 🔗 
Fri 10:50 a.m.  11:00 a.m.

Spotlight talk 3  PyHessian: Neural Networks Through the Lens of the Hessian
(
Talk
)
SlidesLive Video We present PyHessian, a new scalable framework that enables fast computation of Hessian (i.e., secondorder derivative) information for deep neural networks. PyHessianenables fast computation of the top Hessian eigenvalues, the Hessian trace, and the full Hessian eigenvalue/spectral density, and supports distributedmemory execution on cloud/supercomputer systems and available as open source. We show that this framework can be used to analyze neural network models, including the topology of the loss landscape (i.e., curvature information) to gain insight into the behavior of different models/optimizers. In particular, we analyze the effect of Batch Normalization layers on the trainability of NNs. We find that Batch Normalization does not necessarily make the loss landscape smoother, especially for shallow networks, as opposed to common belief. 
Amir Gholaminejad 🔗 
Fri 11:00 a.m.  11:45 a.m.

Talk by Coralia Cartis  Dimensionality reduction techniques for largescale optimization problems
(
Talk
)
Known by many names, sketching techniques allow random projections of data from high to low dimensions while preserving pairwise distances. This talk explores ways to use sketching so as to improve the scalability of algorithms for diverse classes of optimization problems and applications, from linear to nonlinear, local to global, derivativebased to derivativefree. Regression problems and GaussNewton techniques will receive particular attention. Numerical illustrations on standard optimization test problems as well as on some machine learning setups will be presented. This work is joint with Jan Fiala (NAG Ltd), Jaroslav Fowkes (Oxford), Estelle Massart (Oxford and NPL), Adilet Otemissov (Oxford and Turing), Alex Puiu (Oxford), Lindon Roberts (Australian National University, Canberra), Zhen Shao (Oxford). 
Coralia Cartis 🔗 
Fri 11:45 a.m.  11:55 a.m.

Q&A with Coralia Cartis
(
Discussion
)

Coralia Cartis 🔗 
Fri 11:55 a.m.  1:30 p.m.

Break until 13:30pm (PDT)
(
Break
)

🔗 
Fri 1:30 p.m.  1:40 p.m.

Spotlight talk 4  MomentumRNN: Integrating Momentum into Recurrent Neural Networks
(
Talk
)
Designing deep neural networks is an art that often involves an expensive search over candidate architectures. To overcome this for recurrent neural nets (RNNs), we establish a connection between the hidden state dynamics in an RNN and gradient descent (GD). We then integrate momentum into this framework and propose a new family of RNNs, called {\em MomentumRNNs}. We theoretically prove and numerically demonstrate that MomentumRNNs alleviate the vanishing gradient issue in training RNNs. We study the momentum longshort term memory (MomentumLSTM) and verify its advantages in convergence speed and accuracy over its LSTM counterpart across a variety of benchmarks, with little compromise in computational or memory efficiency. We also demonstrate that MomentumRNN is applicable to many types of recurrent cells, including those in the stateoftheart orthogonal RNNs. Finally, we show that other advanced momentumbased optimization methods, such as Adam and Nesterov accelerated gradients with a restart, can be easily incorporated into the MomentumRNN framework for designing new recurrent cells with even better performance. 
Minh Nguyen 🔗 
Fri 1:40 p.m.  1:50 p.m.

Spotlight talk 5  Stepsize Adaptation Using Exponentiated Gradient Updates
(
Talk
)
SlidesLive Video Optimizers like Adam and AdaGrad have been very successful in training largescale neural networks. Yet, the performance of these methods is heavily dependent on a carefully tuned learning rate schedule. We show that in many largescale applications, augmenting a given optimizer with an adaptive tuning method of the stepsize greatly improves the performance. More precisely, we maintain a global stepsize scale for the update as well as a gain factor for each coordinate. We adjust the global scale based on the alignment of the average gradient and the current gradient vectors. A similar approach is used for updating the local gain factors. This type of stepsize scale tuning has been done before with gradient descent updates. In this paper, we update the stepsize scale and the gain variables with exponentiated gradient updates instead. Experimentally, we show that our approach can achieve compelling accuracy on standard models without using any specially tuned learning rate schedule. We also show the effectiveness of our approach for quickly adapting to distribution shifts in the data during training. 
Ehsan Amid 🔗 
Fri 1:50 p.m.  2:00 p.m.

Spotlight talk 6  Competitive Mirror Descent
(
Talk
)
SlidesLive Video Constrained competitive optimization involves multiple agents trying to minimize conflicting objectives, subject to constraints. This is a highly expressive modeling language that subsumes most of modern machine learning.In this work we propose competitive mirror descent (CMD): a general method for solving such problems based on first order information that can be obtained by automatic differentiation.First, by adding Lagrange multipliers, we obtain a simplified constraint set with an associated Bregman potential.At each iteration, we then solve for the Nash equilibrium of a regularized bilinear approximation of the full problem to obtain a direction of movement of the agents.Finally, we obtain the next iterate by following this direction according to the dual geometry induced by the Bregman potential.By using the dual geometry we obtain feasible iterates despite only solving a linear system at each iteration, eliminating the need for projection steps while still accounting for the global nonlinear structure of the constraint set.As a special case we obtain a novel competitive multiplicative weights algorithm for problems on the positive cone. 
Florian Schäfer 🔗 
Fri 2:00 p.m.  2:15 p.m.

Industry Panel  Talk by Boris Ginsburg  Large scale deep learning: new trends and optimization challenges
(
Talk
)
SlidesLive Video I will discuss two major trends in the deep learning. The first trend is an exponential growth in the size of models: from 340M (BERTlarge) in 2018 to 175B (GPT3) in 2020. We need new, more memory efficient algorithms to train such huge models. The second trend is “BERTapproach”, when a model is first pretrained in unsupervised or selfsupervised manner on large unlabeled dataset, and then it is finetuned for another task using a. smaller labeled dataset. This trend sets new theoretical problems. Next, I will discuss a practical need in theoretical foundation for regularization methods used in the deep learning practice: data augmentation, dropout, label smoothing etc. Finally, I will describe an applicationdriven design of new optimization methods using NovoGrad as example. 
Boris Ginsburg 🔗 
Fri 2:15 p.m.  2:30 p.m.

Industry Panel  Talk by Jonathan Hseu  ML Models in Production
(
Talk
)
SlidesLive Video We discuss the difficulties of training and improving ML models on large datasets in production. We also go into the process of an engineer working on ML models, and the challenges of trading off cost, model quality, and performance. Finally, we go into a wishlist of optimization improvements that could improve the workflow of engineers working on these models. 
Jonathan Hseu 🔗 
Fri 2:30 p.m.  2:45 p.m.

Industry Panel  Talk by Andres Rodriguez  Shifting the DL industry to 2nd order methods
(
Talk
)
SlidesLive Video In this talk, we review the topology design process used by data scientists and explain why 1st order methods are computational expensive in the design process. We explore the benefits of 2nd order methods to reduce the topology design cost and highlight recent work that approximates the inverse Hessian. We conclude with recommendations to accelerate the adoption of these methods in the DL ecosystem. 
Andres Rodriguez 🔗 
Fri 2:45 p.m.  3:00 p.m.

Industry Panel  Talk by Lin Xiao  Statistical Adaptive Stochastic Gradient Methods
(
Talk
)
SlidesLive Video Stochastic gradient descent (SGD) and its many variants serve as the workhorses of deep learning. One of the foremost pain points in using these methods in practice is hyperparameter tuning, especially the learning rate (step size). We propose a statistical adaptive procedure called SALSA to automatically schedule the learning rate for a broad family of stochastic gradient methods. SALSA first uses a smoothed linesearch procedure to find a good initial learning rate, then automatically switches to a statistical method, which detects stationarity of the learning process under a fixed learning rate, and drops the learning rate by a constant factor whenever stationarity is detected. The combined procedure is highly robust and autonomous, and it matches the performance of the best handtuned methods in several popular deep learning tasks. 
Lin Xiao 🔗 
Fri 3:00 p.m.  3:30 p.m.

Industry panel Q&A
(
Discussion
)

🔗 
Fri 3:30 p.m.  4:15 p.m.

Talk by Rachel Ward  Weighted Optimization: better generalization by smoother interpolation
(
Talk
)
We provide a rigorous analysis of how implicit bias towards smooth interpolations leads to low generalization error in the overparameterized setting. We provide the first case study of this connection through a random Fourier series model and weighted least squares. We then argue through this model and numerical experiments that normalization methods in deep learning such as weight normalization improve generalization in overparameterized neural networks by implicitly encouraging smooth interpolants. This is work with Yuege (Gail) Xie, Holger Rauhut, and HungHsu Chou. 
Rachel Ward 🔗 
Fri 4:15 p.m.  4:25 p.m.

Q&A with Rachel Ward
(
Discussion
)

Rachel Ward 🔗 
Fri 4:25 p.m.  5:00 p.m.

Break until 17:00pm (PDT)
(
Break
)

🔗 
Fri 5:00 p.m.  5:45 p.m.

Talk by Rio Yokota  Degree of Approximation and Overhead of Computing Curvature, Information, and Noise Matrices
(
Talk
)
SlidesLive Video Hessian, Fisher, and Covariance matrices are not only used for preconditioning optimizers, but also in generalization metrics, predicting hyperparameters, and Bayesian inference. These matrices contain valuable information that can advance theory in statistical learning, but they are very expensive to compute exactly for modern deep neural networks with billions of parameters. We make use of a highly optimized implementation for computing these matrices with various degrees of approximation to close the gap between theory and practice. We are able to significantly reduce the overhead of computing these matrices through a hybrid dataparallel + modelparallel approach. 
Rio Yokota 🔗 
Fri 5:45 p.m.  5:55 p.m.

Q&A with Rio Yokota
(
Discussion
)

Rio Yokota 🔗 
Fri 5:55 p.m.  6:00 p.m.

Closing remarks
(
Discussion
)

🔗 