ICML Poster Random Shuffling Beats SGD after Finite Epochs

Poster

Random Shuffling Beats SGD after Finite Epochs

Jeff HaoChen · Suvrit Sra

Pacific Ballroom #206

Keywords: [ Convex Optimization ] [ Large Scale Learning and Big Data ]

[ Abstract ]

Abstract: A long-standing problem in stochastic optimization is proving that \rsgd, the without-replacement version of \sgd, converges faster than the usual with-replacement \sgd. Building upon~\citep{gurbuzbalaban2015random}, we present the \emph{first} (to our knowledge) non-asymptotic results for this problem by proving that after a reasonable number of epochs \rsgd converges faster than \sgd. Specifically, we prove that for strongly convex, second-order smooth functions, the iterates of \rsgd converge to the optimal solution as

O (\nicefrac 1 T^{2} + \nicefrac n^{3} T^{3})

$\mathcal{O}(\nicefrac{1}{T^2} + \nicefrac{n^3}{T^3})$ , where

n

$n$ is the number of components in the objective, and

T

$T$ is number of iterations. This result implies that after

O (\sqrt{n})

$\mathcal{O}(\sqrt{n})$ epochs, \rsgd is \emph{strictly better} than \sgd (which converges as

O (\nicefrac 1 T)

$\mathcal{O}(\nicefrac{1}{T})$ ). The key step toward showing this better dependence on

T

$T$ is the introduction of

n

$n$ into the bound; and as our analysis shows, in general a dependence on

n

$n$ is unavoidable without further changes. To understand how \rsgd works in practice, we further explore two empirically useful settings: data sparsity and over-parameterization. For sparse data, \rsgd has the rate

O (\frac{1}{T^{2}})

$\mathcal{O}\left(\frac{1}{T^2}\right)$ , again strictly better than \sgd. Under a setting closely related to over-parameterization, \rsgd is shown to converge faster than \sgd after any \emph{arbitrary} number of iterations. Finally, we extend the analysis of \rsgd to smooth non-convex and convex functions.

Live content is unavailable. Log in and register to view live content