ICML Poster Hybrid Stochastic-Deterministic Minibatch Proximal Gradient: Less-Than-Single-Pass Optimization with Nearly Optimal Generalization

Poster

Hybrid Stochastic-Deterministic Minibatch Proximal Gradient: Less-Than-Single-Pass Optimization with Nearly Optimal Generalization

Pan Zhou · Xiao-Tong Yuan

Keywords: [ Convex Optimization ] [ Large Scale Learning and Big Data ] [ Optimization - Convex ]

[ Abstract ]

Abstract: Stochastic variance-reduced gradient (SVRG) algorithms have been shown to work favorably in solving large-scale learning problems. Despite the remarkable success, the stochastic gradient complexity of SVRG-type algorithms usually scales linearly with data size and thus could still be expensive for huge data. To address this deficiency, we propose a hybrid stochastic-deterministic minibatch proximal gradient (HSDMPG) algorithm for strongly-convex problems that enjoys provably improved data-size-independent complexity guarantees. More precisely, for quadratic loss

$F(\wm)$ of

$n$ components, we prove that HSDMPG can attain an

$\epsilon$ -optimization-error

$E[F(\theta)-F(\theta^*)] \leq \epsilon$ within

$\mathcal{O}\Big(\!\frac{\kappa^{1.5}}{\epsilon^{0.25}}\! \log^{\!1.5}\!\!\big(\frac{1}{\epsilon}\big) \wedge \Big(\!\kappa \sqrt{n} \log^2\!\!\big(\frac{1}{\epsilon}\big) \!+\! \frac{\kappa^3}{n^{1.5}\epsilon} \!\Big)\!\Big)$ stochastic gradient evaluations, where

$\kappa$ is condition number. For generic strongly convex loss functions, we prove a nearly identical complexity bound though at the cost of slightly increased logarithmic factors. For large-scale learning problems, our complexity bounds are superior to those of the prior state-of-the-art SVRG algorithms with or without dependence on data size. Particularly, in the case of

$\epsilon\!=\!\mathcal{O}\big(1/\sqrt{n}\big)$ which is at the order of intrinsic excess error bound of a learning model and thus sufficient for generalization, the stochastic gradient complexity bounds of HSDMPG~for quadratic and generic loss functions are respectively

$\mathcal{O} (n^{0.875}\log^{1.5}(n))$ and

$\mathcal{O} (n^{0.875}\log^{2.25}(n))$ , which to our best knowledge, for the first time achieve optimal generalization in less than a single pass over data. Extensive numerical results demonstrate the computational advantages of our algorithm over the prior ones.

Chat is not available.