Skip to yearly menu bar Skip to main content


Session

Parallel and Distributed Learning 1

Abstract:
Chat is not available.

Wed 11 July 2:00 - 2:20 PDT

Optimal Tuning for Divide-and-conquer Kernel Ridge Regression with Massive Data

Ganggang Xu · Zuofeng Shang · Guang Cheng

Divide-and-conquer is a powerful approach for large and massive data analysis. In the nonparameteric regression setting, although various theoretical frameworks have been established to achieve optimality in estimation or hypothesis testing, how to choose the tuning parameter in a practically effective way is still an open problem. In this paper, we propose a data-driven procedure based on divide-and-conquer for selecting the tuning parameters in kernel ridge regression by modifying the popular Generalized Cross-validation (GCV, Wahba, 1990). While the proposed criterion is computationally scalable for massive data sets, it is also shown under mild conditions to be asymptotically optimal in the sense that minimizing the proposed distributed-GCV (dGCV) criterion is equivalent to minimizing the true global conditional empirical loss of the averaged function estimator, extending the existing optimality results of GCV to the divide-and-conquer framework.

Wed 11 July 2:20 - 2:30 PDT

Distributed Nonparametric Regression under Communication Constraints

Yuancheng Zhu · John Lafferty

This paper studies the problem of nonparametric estimation of a smoothfunction with data distributed across multiple machines. We assume anindependent sample from a white noise model is collected at eachmachine, and an estimator of the underlying true function needs to beconstructed at a central machine. We place limits on the number ofbits that each machine can use to transmit information to the centralmachine. Our results give both asymptotic lower bounds and matchingupper bounds on the statistical risk under various settings. Weidentify three regimes, depending on the relationship among the numberof machines, the size of data available at each machine, and thecommunication budget. When the communication budget is small, thestatistical risk depends solely on this communication bottleneck,regardless of the sample size. In the regime where the communicationbudget is large, the classic minimax risk in the non-distributedestimation setting is recovered. In an intermediate regime, thestatistical risk depends on both the sample size and the communicationbudget.

Wed 11 July 2:30 - 2:40 PDT

Coded Sparse Matrix Multiplication

Sinong Wang · Jiashang Liu · Ness Shroff

In a large-scale and distributed matrix multiplication problem $C=A^{\intercal}B$, where $C\in\mathbb{R}^{r\times t}$, the coded computation plays an important role to effectively deal with ``stragglers'' (distributed computations that may get delayed due to few slow or faulty processors). However, existing coded schemes could destroy the significant sparsity that exists in large-scale machine learning problems, and could result in much higher computation overhead, i.e., $O(rt)$ decoding time. In this paper, we develop a new coded computation strategy, we call \emph{sparse code}, which achieves near \emph{optimal recovery threshold}, \emph{low computation overhead}, and \emph{linear decoding time} $O(nnz(C))$. We implement our scheme and demonstrate the advantage of the approach over both uncoded and current fastest coded strategies.

Wed 11 July 2:40 - 2:50 PDT

Towards More Efficient Stochastic Decentralized Learning: Faster Convergence and Sparse Communication

Zebang Shen · Aryan Mokhtari · Tengfei Zhou · Peilin Zhao · Hui Qian

Recently, the decentralized optimization problem is attracting growing attention. Most existing methods are deterministic with high per-iteration cost and have a convergence rate quadratically depending on the problem condition number. Besides, the dense communication is necessary to ensure the convergence even if the dataset is sparse. In this paper, we generalize the decentralized optimization problem to a monotone operator root finding problem, and propose a stochastic algorithm named DSBA that (1) converges geometrically with a rate linearly depending on the problem condition number, and (2) can be implemented using sparse communication only. Additionally, DSBA handles important learning problems like AUC-maximization which can not be tackled efficiently in the previous problem setting. Experiments on convex minimization and AUC-maximization validate the efficiency of our method.

Wed 11 July 2:50 - 3:00 PDT

Faster Derivative-Free Stochastic Algorithm for Shared Memory Machines

Bin Gu · Zhouyuan Huo · Cheng Deng · Heng Huang

Asynchronous parallel stochastic gradient optimization has been playing a pivotal role to solve large-scale machine learning problems in big data applications. Zeroth-order (derivative-free) methods estimate the gradient only by two function evaluations, thus have been applied to solve the problems where the explicit gradient calculations are computationally expensive or infeasible. Recently, the first asynchronous parallel stochastic zeroth-order algorithm (AsySZO) was proposed. However, its convergence rate is O(1/SQRT{T}) for the smooth, possibly non-convex learning problems, which is significantly slower than O(1/T) the best convergence rate of (asynchronous) stochastic gradient algorithm. To fill this gap, in this paper, we first point out the fundamental reason leading to the slow convergence rate of AsySZO, and then propose a new asynchronous stochastic zerothorder algorithm (AsySZO+). We provide a faster convergence rate O(1/bT) (b is the mini-batch size) for AsySZO+ by the rigorous theoretical analysis, which is a significant improvement over O(1/SQRT{T}). The experimental results on the application of ensemble learning confirm that our AsySZO+ has a faster convergence rate than the existing (asynchronous) stochastic zeroth-order algorithms.