Clipping Makes Distributed and Federated Asynchronous SGD Robust to Stragglers
Abstract
In modern machine learning, parallelization of training is an important strategy for increasing scale. Asynchronous stochastic gradient descent (ASGD), which maximally utilizes available hardware, avoids having to wait for slow workers. However, with constant step sizes, the convergence of ASGD is nonetheless negatively effected by slow workers due to large delays in updates. At the same time, it has been empirically observed in asynchronous training of deep learning models that gradient clipping ``stabilizes'' training. In this work, we provide a theoretical justification for this behavior, as we show that clipping removes the dependence of the maximum delay in the oracle complexity. We employ a sub-Weibull model of gradient noise which generalize sub-Gaussian and sub-exponential disitributions to more heavy-tailed distributions, motivated by empirical observations in deep learning. We show convergence in expectation, and for the first time in asynchronous optimization, convergence with high probability.