Proximal and Federated Random Reshuffling

Konstantin Mishchenko · Ahmed Khaled · Peter Richtarik

Hall E #722

Keywords: [ OPT: Convex ] [ OPT: Non-Convex ] [ OPT: Stochastic ] [ OPT: Large Scale, Parallel and Distributed ]

[ Abstract ]
[ Paper PDF
Thu 21 Jul 3 p.m. PDT — 5 p.m. PDT
Spotlight presentation: Applications/Optimization
Thu 21 Jul 12:30 p.m. PDT — 2 p.m. PDT

Abstract: Random Reshuffling (RR), also known as Stochastic Gradient Descent (SGD) without replacement, is a popular and theoretically grounded method for finite-sum minimization. We propose two new algorithms: Proximal and Federated Random Reshuffling (ProxRR and FedRR). The first algorithm, ProxRR, solves composite finite-sum minimization problems in which the objective is the sum of a (potentially non-smooth) convex regularizer and an average of $n$ smooth objectives. ProxRR evaluates the proximal operator once per epoch only. When the proximal operator is expensive to compute, this small difference makes ProxRR up to $n$ times faster than algorithms that evaluate the proximal operator in every iteration, such as proximal (stochastic) gradient descent. We give examples of practical optimization tasks where the proximal operator is difficult to compute and ProxRR has a clear advantage. One such task is federated or distributed optimization, where the evaluation of the proximal operator corresponds to communication across the network. We obtain our second algorithm, FedRR, as a special case of ProxRR applied to federated optimization, and prove it has a smaller communication footprint than either distributed gradient descent or Local SGD. Our theory covers both constant and decreasing stepsizes, and allows for importance resampling schemes that can improve conditioning, which may be of independent interest. Our theory covers both convex and nonconvex regimes. Finally, we corroborate our results with experiments on real data sets.

Chat is not available.