Session
Optimization (Convex) 3
Shampoo: Preconditioned Stochastic Tensor Optimization
Vineet Gupta · Tomer Koren · Yoram Singer
Preconditioned gradient methods are among the most general and powerful toolsin optimization. However, preconditioning requires storing and manipulatingprohibitively large matrices. We describe and analyze a new structure-awarepreconditioning algorithm, called Shampoo, for stochastic optimization overtensor spaces. Shampoo maintains a set of preconditioning matrices, each ofwhich operates on a single dimension, contracting over the remainingdimensions. We establish convergence guarantees in the stochastic convexsetting, the proof of which builds upon matrix trace inequalities. Ourexperiments with state-of-the-art deep learning models show that Shampoo iscapable of converging considerably faster than commonly used optimizers.Surprisingly, although it involves a more complex update rule, Shampoo's runtime per step is comparable in practice to that of simple gradientmethods such as SGD, AdaGrad, and Adam.
Characterizing Implicit Bias in Terms of Optimization Geometry
Suriya Gunasekar · Jason Lee · Daniel Soudry · Nati Srebro
We study the bias of generic optimization methods, including MirrorDescent, Natural Gradient Descent and Steepest Descent with respectto different potentials and norms, when optimizing underdeterminedlinear models or separable linear classification problems. We askthe question of whether the global minimum (among the many possibleglobal minima) reached by optimization can be characterized in termsof the potential or norm, and indecently of hyper-parameter choices,such as stepsize and momentum.
A Distributed Second-Order Algorithm You Can Trust
Celestine Mendler-Dünner · Aurelien Lucchi · Matilde Gargiani · Yatao Bian · Thomas Hofmann · Martin Jaggi
Due to the rapid growth of data and computational resources, distributed optimization has become an active research area in recent years. While first-order methods seem to dominate the field, second-order methods are nevertheless attractive as they potentially require fewer communication rounds to converge. However, there are significant drawbacks that impede their wide adoption, such as the computation and the communication of a large Hessian matrix. In this paper we present a new algorithm for distributed training of generalized linear models that only requires the computation of diagonal blocks of the Hessian matrix on the individual workers. To deal with this approximate information we propose an adaptive approach that - akin to trust-region methods - dynamically adapts the auxiliary model to compensate for modeling errors. We provide theoretical rates of convergence for a wide class of problems including $L_1$-regularized objectives. We also demonstrate that our approach achieves state-of-the-art results on multiple large benchmark datasets.
A Delay-tolerant Proximal-Gradient Algorithm for Distributed Learning
Konstantin Mishchenko · Franck Iutzeler · Jérôme Malick · Massih-Reza Amini
Distributed learning aims at computing high-quality models by training over scattered data. This covers a diversity of scenarios, including computer clusters or mobile agents. One of the main challenges is then to deal with heterogeneous machines and unreliable communications. In this setting, we propose and analyze a flexible asynchronous optimization algorithm for solving nonsmooth learning problems. Unlike most existing methods, our algorithm is adjustable to various levels of communication costs, machines computational powers, and data distribution evenness. We prove that the algorithm converges linearly with a fixed learning rate that does not depend on communication delays nor on the number of machines. Although long delays in communication may slow down performance, no delay can break convergence.
Gradient Coding from Cyclic MDS Codes and Expander Graphs
Netanel Raviv · Rashish Tandon · Alexandros Dimakis · Itzhak Tamo
Gradient coding is a technique for straggler mitigation in distributed learning. In this paper we design novel gradient codes using tools from classical coding theory, namely, cyclic MDS codes, which compare favourably with existing solutions, both in the applicable range of parameters and in the complexity of the involved algorithms. Second, we introduce an approximate variant of the gradient coding problem, in which we settle for approximate gradient computation instead of the exact one. This approach enables graceful degradation, i.e., the $\ell_2$ error of the approximate gradient is a decreasing function of the number of stragglers. Our main result is that the normalized adjacency matrix of an expander graph can yield excellent approximate gradient codes, and that this approach allows us to perform significantly less computation compared to exact gradient coding. We experimentally test our approach on Amazon EC2, and show that the generalization error of approximate gradient coding is very close to the full gradient while requiring significantly less computation from the workers.