Timezone: »

Oral
PAGE: A Simple and Optimal Probabilistic Gradient Estimator for Nonconvex Optimization
Zhize Li · Hongyan Bao · Xiangliang Zhang · Peter Richtarik

Tue Jul 20 06:00 AM -- 06:20 AM (PDT) @ None
In this paper, we propose a novel stochastic gradient estimator---ProbAbilistic Gradient Estimator (PAGE)---for nonconvex optimization. PAGE is easy to implement as it is designed via a small adjustment to vanilla SGD: in each iteration, PAGE uses the vanilla minibatch SGD update with probability $p_t$ or reuses the previous gradient with a small adjustment, at a much lower computational cost, with probability $1-p_t$. We give a simple formula for the optimal choice of $p_t$. Moreover, we prove the first tight lower bound $\Omega(n+\frac{\sqrt{n}}{\epsilon^2})$ for nonconvex finite-sum problems, which also leads to a tight lower bound $\Omega(b+\frac{\sqrt{b}}{\epsilon^2})$ for nonconvex online problems, where $b:= \min\{\frac{\sigma^2}{\epsilon^2}, n\}$. Then, we show that PAGE obtains the optimal convergence results $O(n+\frac{\sqrt{n}}{\epsilon^2})$ (finite-sum) and $O(b+\frac{\sqrt{b}}{\epsilon^2})$ (online) matching our lower bounds for both nonconvex finite-sum and online problems. Besides, we also show that for nonconvex functions satisfying the Polyak-\L ojasiewicz (PL) condition, PAGE can automatically switch to a faster linear convergence rate $O(\cdot\log \frac{1}{\epsilon})$. Finally, we conduct several deep learning experiments (e.g., LeNet, VGG, ResNet) on real datasets in PyTorch showing that PAGE not only converges much faster than SGD in training but also achieves the higher test accuracy, validating the optimal theoretical results and confirming the practical superiority of PAGE.

#### Author Information

##### Peter Richtarik (KAUST)

Peter Richtarik is an Associate Professor of Computer Science and Mathematics at KAUST and an Associate Professor of Mathematics at the University of Edinburgh. He is an EPSRC Fellow in Mathematical Sciences, Fellow of the Alan Turing Institute, and is affiliated with the Visual Computing Center and the Extreme Computing Research Center at KAUST. Dr. Richtarik received his PhD from Cornell University in 2007, and then worked as a Postdoctoral Fellow in Louvain, Belgium, before joining Edinburgh in 2009, and KAUST in 2017. Dr. Richtarik's research interests lie at the intersection of mathematics, computer science, machine learning, optimization, numerical linear algebra, high performance computing and applied probability. Through his recent work on randomized decomposition algorithms (such as randomized coordinate descent methods, stochastic gradient descent methods and their numerous extensions, improvements and variants), he has contributed to the foundations of the emerging field of big data optimization, randomized numerical linear algebra, and stochastic methods for empirical risk minimization. Several of his papers attracted international awards, including the SIAM SIGEST Best Paper Award, the IMA Leslie Fox Prize (2nd prize, twice), and the INFORMS Computing Society Best Student Paper Award (sole runner up). He is the founder and organizer of the Optimization and Big Data workshop series.​