Session
Deep Learning Optimization
An Investigation into Neural Net Optimization via Hessian Eigenvalue Density
Behrooz Ghorbani · Shankar Krishnan · Ying Xiao
To understand the dynamics of training in deep neural networks, we study the evolution of the Hessian eigenvalue density throughout the optimization process. In non-batch normalized networks, we observe the rapid appearance of large isolated eigenvalues in the spectrum, along with a surprising concentration of the gradient in the corresponding eigenspaces. In a batch normalized network, these two effects are almost absent. We give a theoretical rationale to partially explain these phenomena. As part of this work, we adapt advanced tools from numerical linear algebra that allow scalable and accurate estimation of the entire Hessian spectrum of ImageNet-scale neural networks; this technique may be of independent interest in other applications.
Differentiable Linearized ADMM
Xingyu Xie · Jianlong Wu · Guangcan Liu · Zhisheng Zhong · Zhouchen Lin
Recently, a great many learning-based optimization methods that combine data-driven architectures with the classical optimization algorithms have been proposed and explored, showing superior empirical performance in solving various ill-posed inverse problems. However, there is still a scarcity of rigorous analysis about the convergence behaviors of learning-based optimization. In particular, most existing theories are specific to unconstrained problems but cannot apply to the more general cases where some variables of interest are subject to certain constraints. In this paper, we propose Differentiable Linearized ADMM (D-LADMM) for solving the problems with linear constraints. Specifically, D-LADMM is a K-layer LADMM inspired deep neural network, which is obtained by firstly introducing some learnable weights in the classical Linearized ADMM algorithm and then generalizing the proximal operator to some learnable activation function. Notably, we mathematically prove that there exist a set of learnable parameters for D-LADMM to generate globally converged solutions, and we show that those desired parameters can be attained by training D-LADMM in a proper way. To the best of our knowledge, we are the first one to provide the convergence analysis for the learning-based optimization method on constrained problems. Experiments on simulative and real applications verify the superiorities of D-LADMM over LADMM.
Adaptive Stochastic Natural Gradient Method for One-Shot Neural Architecture Search
Youhei Akimoto · Shinichi Shirakawa · Nozomu Yoshinari · Kento Uchida · Shota Saito · Kouhei Nishida
High sensitivity of neural architecture search (NAS) methods against their input such as step-size (i.e., learning rate) and search space prevents practitioners from applying them out-of-the-box to their own problems, albeit its purpose is to automate a part of tuning process. Aiming at a fast, robust, and widely-applicable NAS, we develop a generic optimization framework for NAS. We turn a coupled optimization of connection weights and neural architecture into a differentiable optimization by means of stochastic relaxation. It accepts arbitrary search space (widely-applicable) and enables to employ a gradient-based simultaneous optimization of weights and architecture (fast). We propose a stochastic natural gradient method with an adaptive step-size mechanism built upon our theoretical investigation (robust). Despite its simplicity and no problem-dependent parameter tuning, our method exhibited near state-of-the-art performances with low computational budgets both on image classification and inpainting tasks.
A Quantitative Analysis of the Effect of Batch Normalization on Gradient Descent
YongQiang Cai · Qianxiao Li · Zuowei Shen
Despite its empirical success and recent theoretical progress, there generally lacks a quantitative analysis of the effect of batch normalization (BN) on the convergence and stability of gradient descent. In this paper, we provide such an analysis on the simple problem of ordinary least squares (OLS). Since precise dynamical properties of gradient descent (GD) is completely known for the OLS problem, it allows us to isolate and compare the additional effects of BN. More precisely, we show that unlike GD, gradient descent with BN (BNGD) converges for arbitrary learning rates for the weights, and the convergence remains linear under mild conditions. Moreover, we quantify two different sources of acceleration of BNGD over GD -- one due to over-parameterization which improves the effective condition number and another due having a large range of learning rates giving rise to fast descent. These phenomena set BNGD apart from GD and could account for much of its robustness properties. These findings are confirmed quantitatively by numerical experiments, which further show that many of the uncovered properties of BNGD in OLS are also observed qualitatively in more complex supervised learning problems.
The Effect of Network Width on Stochastic Gradient Descent and Generalization: an Empirical Study
Daniel Park · Jascha Sohl-Dickstein · Quoc Le · Samuel Smith
We investigate how the behavior of stochastic gradient descent is influenced by model size. By studying families of models obtained by increasing the number of channels in a base network, we examine how the optimal hyperparameters---the batch size and learning rate at which the test error is minimized---correlate with the network width. We find that the optimal "normalized noise scale," which we define to be a function of the batch size, learning rate and the initialization conditions, is proportional to the number of channels (in the absence of batch normalization). This conclusion holds for MLPs, ConvNets and ResNets. A surprising consequence is that if we wish to maintain optimal performance as the network width increases, we must use increasingly small batch sizes. Based on our experiments, we also conjecture that there may be a critical width, beyond which the optimal performance of networks trained with constant SGD ceases to improve unless additional regularization is introduced.
AdaGrad stepsizes: sharp convergence over nonconvex landscapes
Rachel Ward · Xiaoxia Wu · Leon Bottou
Adaptive gradient methods such as AdaGrad and its variants update the stepsize in stochastic gradient descent on the fly according to the gradients received along the way; such methods have gained widespread use in large-scale optimization for their ability to converge robustly, without the need to fine tune parameters such as the stepsize schedule. Yet, the theoretical guarantees to date for AdaGrad are for online and convex optimization. We bridge this gap by providing strong theoretical guarantees for the convergence of AdaGrad over smooth, nonconvex landscapes. We show that AdaGrad converges to a stationary point at the optimal $O(1/\sqrt{N})$ rate (up to a $\log(N)$ factor), and at the optimal $O(1/N)$ rate in the non-stochastic setting . In particular, both our theoretical and numerical results imply that AdaGrad is robust to the \emph{unknown Lipschitz constant and level of stochastic noise on the gradient, in a near-optimal sense. }
Beyond Backprop: Online Alternating Minimization with Auxiliary Variables
Anna Choromanska · Benjamin Cowen · Sadhana Kumaravel · Ronny Luss · Mattia Rigotti · Irina Rish · Paolo DiAchille · Viatcheslav Gurev · Brian Kingsbury · Ravi Tejwani · Djallel Bouneffouf
Despite significant recent advances in deep neural networks, training them remains a challenge due to highly non-convex nature of the objective function. State-of-art methods rely on error backpropagation, which suffers from several well-known issues, such as vanishing and exploding gradients, inability to handle non-differentiable nonlinearities and to parallelize weight-update across layers, and biological implausibility. These limitations continue to motivate exploration of alternative training algorithms, including several recently proposed auxiliary-variable methods which break the complex nested objective function into local subproblems, avoiding gradient chains and thus the vanishing gradient issue, allowing weight update parallelization, among other advantages. However, those techniques are mainly offline (batch), which limits their applicability to extremely large datasets or unlimited data streams in online, continual or reinforcement learning. The main contribution of our work is a novel online (stochastic/mini-batch) alternating minimization (AM) algorithm for training deep neural networks, together with the first theoretical convergence guarantees for AM in stochastic settings, and extensive empirical evaluation on various architectures and datasets, demonstrating advantages of the proposed approach as compared to both offline auxiliary variable methods and to the backpropagation-based stochastic gradient descent.
SWALP : Stochastic Weight Averaging in Low Precision Training
Guandao Yang · Tianyi Zhang · Polina Kirichenko · Junwen Bai · Andrew Wilson · Christopher De Sa
Low precision operations can provide scalability, memory savings, portability, and energy efficiency. This paper proposes SWALP, an approach to low precision training that averages low-precision SGD iterates with a modified learning rate schedule. SWALP is easy to implement and can match the performance of full-precision SGD even with all numbers quantized down to 8 bits, including the gradient accumulators. Additionally, we show that SWALP converges arbitrarily close to the optimal solution for quadratic objectives, and to a noise ball asymptotically smaller than low precision SGD in strongly convex settings.
Efficient optimization of loops and limits with randomized telescoping sums
Alex Beatson · Ryan P Adams
We consider optimization problems in which the objective requires an inner loop with many steps or is the limit of a sequence of increasingly costly approximations. Meta-learning, training recurrent neural networks, and optimization of the solutions to differential equations are all examples of optimization problems with this character. In such problems, it can be expensive to compute the objective function value and its gradient, but truncating the loop or using less accurate approximations can induce biases that damage the overall solution. We propose \emph{randomized telescope} (RT) gradient estimators, which represent the objective as the sum of a telescoping series and sample linear combinations of terms to provide cheap unbiased gradient estimates. We identify conditions under which RT estimators achieve optimization convergence rates independent of the length of the loop or the required accuracy of the approximation. We also derive a method for tuning RT estimators online to maximize a lower bound on the expected decrease in loss per unit of computation. We evaluate our adaptive RT estimators on a range of applications including meta-optimization of learning rates, variational inference of ODE parameters, and training an LSTM to model long sequences.
Self-similar Epochs: Value in arrangement
Eliav Buchnik · Edith Cohen · Avinatan Hasidim · Yossi Matias
Optimization of a machine learning model is typically carried out by performing stochastic gradient updates on epochs that consist of randomly ordered training examples. This practice means that each fraction of an epoch comprises an independent random sample of the training data that may not preserve informative structure present in the full data. We hypothesize that the training can be more effective, allowing each epoch to provide some of the benefits of multiple ones, with more principled, ``self-similar'' arrangements.
Our case study is matrix factorization, commonly used to learn metric embeddings of entities such as videos or words from example associations. We construct arrangements that preserve the weighted Jaccard similarities of rows and columns and experimentally observe that our arrangements yield training acceleration of 3\%-30\% on synthetic and recommendation datasets. Principled arrangements of training examples emerge as a novel and potentially powerful performance knob for SGD that merits further exploration.