Beyond first order methods in machine learning systems

Albert S Berahas, Amir Gholaminejad, Tasos Kyrillidis, Michael Mahoney, Fred Roosta

Keywords:  Optimization    Second Order Methods    Stochastic Gradient Descent  


Optimization lies at the heart of many exciting developments in machine learning, statistics and signal processing. As models become more complex and datasets get larger, finding efficient, reliable and provable methods is one of the primary goals in these fields.

In the last few decades, much effort has been devoted to the development of first-order methods. These methods enjoy a low per-iteration cost and have optimal complexity, are easy to implement, and have proven to be effective for most machine learning applications. First-order methods, however, have significant limitations: (1) they require fine hyper-parameter tuning, (2) they do not incorporate curvature information, and thus are sensitive to ill-conditioning, and (3) they are often unable to fully exploit the power of distributed computing architectures.

Higher-order methods, such as Newton, quasi-Newton and adaptive gradient descent methods, are extensively used in many scientific and engineering domains. At least in theory, these methods possess several nice features: they exploit local curvature information to mitigate the effects of ill-conditioning, they avoid or diminish the need for hyper-parameter tuning, and they have enough concurrency to take advantage of distributed computing environments. Researchers have even developed stochastic versions of higher-order methods, that feature speed and scalability by incorporating curvature information in an economical and judicious manner. However, often higher-order methods are “undervalued.”

This workshop will attempt to shed light on this statement. Topics of interest include, but are not limited to, second-order methods, adaptive gradient descent methods, regularization techniques, as well as techniques based on higher-order derivatives.

Chat is not available.

Timezone: »


Fri 8:00 a.m. - 8:15 a.m.

Introductory notes for the workshop

Fri 8:15 a.m. - 9:00 a.m.

Since the late 1950's when quasi-Newton methods first appeared, they have become one of the most widely used and efficient algorithmic paradigms for unconstrained optimization. Despite their immense practical success, there is little theory that shows why these methods are so efficient. We provide a semi-local rate of convergence for the randomized BFGS method which can be significantly better than that of gradient descent, finally giving theoretical evidence supporting the superior empirical performance of the method.

Peter Richtarik
Fri 9:00 a.m. - 9:10 a.m.
Q&A with Peter Richtarik (Discussion)
Peter Richtarik
Fri 9:10 a.m. - 9:55 a.m.

We will study large-scale convex optimization algorithms based on the Newton method applied to regularized generalized self-concordant losses, which include logistic regression and softmax regression. We first prove that our new simple scheme based on a sequence of problems with decreasing regularization parameters is provably globally convergent, that this convergence is linear with a constant factor which scales only logarithmically with the condition number. In the parametric setting, we obtain an algorithm with the same scaling than regular first-order methods but with an improved behavior, in particular in ill-conditioned problems. Second, in the non parametric machine learning setting, we provide an explicit algorithm combining the previous scheme with Nyström projection techniques, and prove that it achieves optimal generalization bounds with a time complexity of order O(n\sqrt{n}), a memory complexity of order O(n) and no dependence on the condition number, generalizing the results known for least-squares regression (Joint work with Ulysse Marteau-Ferey and Alessandro Rudi,

Francis Bach
Fri 9:55 a.m. - 10:05 a.m.
Q&A with Francis Bach (Discussion)
Francis Bach
Fri 10:05 a.m. - 10:30 a.m.
Break until 10:30am (PDT) (Break)
Fri 10:30 a.m. - 10:40 a.m.

We consider the problem of minimizing the sum of a convex function and a sparsity-inducing group regularizer. Problems involving such regularizers arise in modern machine learning applications often for the purpose of obtaining models that are easier to interpret and that have higher predictive accuracy. We present a new second-order algorithm for solving such problems that uses subspace acceleration, domain decomposition, and support identification. Our analysis shows that our algorithm has strong complexity properties, and preliminary numerical results for binary classification based on regularized logistic regression show that our approach is efficient and robust.

Daniel Robinson
Fri 10:40 a.m. - 10:50 a.m.

Over the last decade, a single algorithm has changed many facets of our lives - Stochastic Gradient Descent (SGD). In the era of ever decreasing loss functions, SGD and its various offspring are a key component of the success of deep neural networks. However, in some cases it may matter which local optimum is found, and this is often context-dependent. Examples frequently arise in machine learning, from shape-versus-texture-features to ensemble methods and zero-shot coordination. In these settings, there are desired types of solutions which SGD on `standard' loss functions will not find. In this paper, we present a different approach. Rather than following the a locally greedy gradient, we instead follow the eigenvectors of the Hessian. We call these 'ridges'. By iteratively following and branching amongst the ridges, we effectively span the loss surface to find qualitatively different solutions. We show both theoretically and experimentally that our method, called Ridge Riding, offers a promising direction.

Jack Parker-Holder
Fri 10:50 a.m. - 11:00 a.m.

We present PyHessian, a new scalable framework that enables fast computation of Hessian (i.e., second-order derivative) information for deep neural networks. PyHessianenables fast computation of the top Hessian eigenvalues, the Hessian trace, and the full Hessian eigenvalue/spectral density, and supports distributed-memory execution on cloud/supercomputer systems and available as open source. We show that this framework can be used to analyze neural network models, including the topology of the loss landscape (i.e., curvature information) to gain insight into the behavior of different models/optimizers. In particular, we analyze the effect of Batch Normalization layers on the trainability of NNs. We find that Batch Normalization does not necessarily make the loss landscape smoother, especially for shallow networks, as opposed to common belief.

Amir Gholaminejad
Fri 11:00 a.m. - 11:45 a.m.

Known by many names, sketching techniques allow random projections of data from high to low dimensions while preserving pairwise distances. This talk explores ways to use sketching so as to improve the scalability of algorithms for diverse classes of optimization problems and applications, from linear to nonlinear, local to global, derivative-based to derivative-free. Regression problems and Gauss-Newton techniques will receive particular attention. Numerical illustrations on standard optimization test problems as well as on some machine learning set-ups will be presented. This work is joint with Jan Fiala (NAG Ltd), Jaroslav Fowkes (Oxford), Estelle Massart (Oxford and NPL), Adilet Otemissov (Oxford and Turing), Alex Puiu (Oxford), Lindon Roberts (Australian National University, Canberra), Zhen Shao (Oxford).

Coralia Cartis
Fri 11:45 a.m. - 11:55 a.m.
Q&A with Coralia Cartis (Discussion)
Coralia Cartis
Fri 11:55 a.m. - 1:30 p.m.
Break until 13:30pm (PDT) (Break)
Fri 1:30 p.m. - 1:40 p.m.

Designing deep neural networks is an art that often involves an expensive search over candidate architectures. To overcome this for recurrent neural nets (RNNs), we establish a connection between the hidden state dynamics in an RNN and gradient descent (GD). We then integrate momentum into this framework and propose a new family of RNNs, called {\em MomentumRNNs}. We theoretically prove and numerically demonstrate that MomentumRNNs alleviate the vanishing gradient issue in training RNNs. We study the momentum long-short term memory (MomentumLSTM) and verify its advantages in convergence speed and accuracy over its LSTM counterpart across a variety of benchmarks, with little compromise in computational or memory efficiency. We also demonstrate that MomentumRNN is applicable to many types of recurrent cells, including those in the state-of-the-art orthogonal RNNs. Finally, we show that other advanced momentum-based optimization methods, such as Adam and Nesterov accelerated gradients with a restart, can be easily incorporated into the MomentumRNN framework for designing new recurrent cells with even better performance.

Fri 1:40 p.m. - 1:50 p.m.

Optimizers like Adam and AdaGrad have been very successful in training large-scale neural networks. Yet, the performance of these methods is heavily dependent on a carefully tuned learning rate schedule. We show that in many large-scale applications, augmenting a given optimizer with an adaptive tuning method of the step-size greatly improves the performance. More precisely, we maintain a global step-size scale for the update as well as a gain factor for each coordinate. We adjust the global scale based on the alignment of the average gradient and the current gradient vectors. A similar approach is used for updating the local gain factors. This type of step-size scale tuning has been done before with gradient descent updates. In this paper, we update the step-size scale and the gain variables with exponentiated gradient updates instead. Experimentally, we show that our approach can achieve compelling accuracy on standard models without using any specially tuned learning rate schedule. We also show the effectiveness of our approach for quickly adapting to distribution shifts in the data during training.

Ehsan Amid
Fri 1:50 p.m. - 2:00 p.m.

Constrained competitive optimization involves multiple agents trying to minimize conflicting objectives, subject to constraints. This is a highly expressive modeling language that subsumes most of modern machine learning.In this work we propose competitive mirror descent (CMD): a general method for solving such problems based on first order information that can be obtained by automatic differentiation.First, by adding Lagrange multipliers, we obtain a simplified constraint set with an associated Bregman potential.At each iteration, we then solve for the Nash equilibrium of a regularized bilinear approximation of the full problem to obtain a direction of movement of the agents.Finally, we obtain the next iterate by following this direction according to the dual geometry induced by the Bregman potential.By using the dual geometry we obtain feasible iterates despite only solving a linear system at each iteration, eliminating the need for projection steps while still accounting for the global nonlinear structure of the constraint set.As a special case we obtain a novel competitive multiplicative weights algorithm for problems on the positive cone.

Florian Schaefer
Fri 2:00 p.m. - 2:15 p.m.

I will discuss two major trends in the deep learning. The first trend is an exponential growth in the size of models: from 340M (BERT-large) in 2018 to 175B (GPT3) in 2020. We need new, more memory efficient algorithms to train such huge models. The second trend is “BERT-approach”, when a model is first pre-trained in unsupervised or self-supervised manner on large unlabeled dataset, and then it is fine-tuned for another task using a. smaller labeled dataset. This trend sets new theoretical problems. Next, I will discuss a practical need in theoretical foundation for regularization methods used in the deep learning practice: data augmentation, dropout, label smoothing etc. Finally, I will describe an application-driven design of new optimization methods using NovoGrad as example.

Boris Ginsburg
Fri 2:15 p.m. - 2:30 p.m.

We discuss the difficulties of training and improving ML models on large datasets in production. We also go into the process of an engineer working on ML models, and the challenges of trading off cost, model quality, and performance. Finally, we go into a wishlist of optimization improvements that could improve the workflow of engineers working on these models.

Jonathan Hseu
Fri 2:30 p.m. - 2:45 p.m.

In this talk, we review the topology design process used by data scientists and explain why 1st order methods are computational expensive in the design process. We explore the benefits of 2nd order methods to reduce the topology design cost and highlight recent work that approximates the inverse Hessian. We conclude with recommendations to accelerate the adoption of these methods in the DL ecosystem.

Andres Rodriguez
Fri 2:45 p.m. - 3:00 p.m.

Stochastic gradient descent (SGD) and its many variants serve as the workhorses of deep learning. One of the foremost pain points in using these methods in practice is hyperparameter tuning, especially the learning rate (step size). We propose a statistical adaptive procedure called SALSA to automatically schedule the learning rate for a broad family of stochastic gradient methods. SALSA first uses a smoothed line-search procedure to find a good initial learning rate, then automatically switches to a statistical method, which detects stationarity of the learning process under a fixed learning rate, and drops the learning rate by a constant factor whenever stationarity is detected. The combined procedure is highly robust and autonomous, and it matches the performance of the best hand-tuned methods in several popular deep learning tasks.

Lin Xiao
Fri 3:00 p.m. - 3:30 p.m.
Industry panel Q&A (Discussion)
Fri 3:30 p.m. - 4:15 p.m.

We provide a rigorous analysis of how implicit bias towards smooth interpolations leads to low generalization error in the overparameterized setting. We provide the first case study of this connection through a random Fourier series model and weighted least squares. We then argue through this model and numerical experiments that normalization methods in deep learning such as weight normalization improve generalization in overparameterized neural networks by implicitly encouraging smooth interpolants. This is work with Yuege (Gail) Xie, Holger Rauhut, and Hung-Hsu Chou.

Rachel Ward
Fri 4:15 p.m. - 4:25 p.m.
Q&A with Rachel Ward (Discussion)
Rachel Ward
Fri 4:25 p.m. - 5:00 p.m.
Break until 17:00pm (PDT) (Break)
Fri 5:00 p.m. - 5:45 p.m.

Hessian, Fisher, and Covariance matrices are not only used for preconditioning optimizers, but also in generalization metrics, predicting hyperparameters, and Bayesian inference. These matrices contain valuable information that can advance theory in statistical learning, but they are very expensive to compute exactly for modern deep neural networks with billions of parameters. We make use of a highly optimized implementation for computing these matrices with various degrees of approximation to close the gap between theory and practice. We are able to significantly reduce the overhead of computing these matrices through a hybrid data-parallel + model-parallel approach.

Rio Yokota
Fri 5:45 p.m. - 5:55 p.m.
Q&A with Rio Yokota (Discussion)
Rio Yokota
Fri 5:55 p.m. - 6:00 p.m.
Closing remarks (Discussion)