Session
Parallel and Distributed Learning 1
Optimal Tuning for Divide-and-conquer Kernel Ridge Regression with Massive Data
Ganggang Xu · Zuofeng Shang · Guang Cheng
Divide-and-conquer is a powerful approach for large and massive data analysis. In the nonparameteric regression setting, although various theoretical frameworks have been established to achieve optimality in estimation or hypothesis testing, how to choose the tuning parameter in a practically effective way is still an open problem. In this paper, we propose a data-driven procedure based on divide-and-conquer for selecting the tuning parameters in kernel ridge regression by modifying the popular Generalized Cross-validation (GCV, Wahba, 1990). While the proposed criterion is computationally scalable for massive data sets, it is also shown under mild conditions to be asymptotically optimal in the sense that minimizing the proposed distributed-GCV (dGCV) criterion is equivalent to minimizing the true global conditional empirical loss of the averaged function estimator, extending the existing optimality results of GCV to the divide-and-conquer framework.
Distributed Nonparametric Regression under Communication Constraints
Yuancheng Zhu · John Lafferty
This paper studies the problem of nonparametric estimation of a smoothfunction with data distributed across multiple machines. We assume anindependent sample from a white noise model is collected at eachmachine, and an estimator of the underlying true function needs to beconstructed at a central machine. We place limits on the number ofbits that each machine can use to transmit information to the centralmachine. Our results give both asymptotic lower bounds and matchingupper bounds on the statistical risk under various settings. Weidentify three regimes, depending on the relationship among the numberof machines, the size of data available at each machine, and thecommunication budget. When the communication budget is small, thestatistical risk depends solely on this communication bottleneck,regardless of the sample size. In the regime where the communicationbudget is large, the classic minimax risk in the non-distributedestimation setting is recovered. In an intermediate regime, thestatistical risk depends on both the sample size and the communicationbudget.
Coded Sparse Matrix Multiplication
Sinong Wang · Jiashang Liu · Ness Shroff
In a large-scale and distributed matrix multiplication problem $C=A^{\intercal}B$, where $C\in\mathbb{R}^{r\times t}$, the coded computation plays an important role to effectively deal with ``stragglers'' (distributed computations that may get delayed due to few slow or faulty processors). However, existing coded schemes could destroy the significant sparsity that exists in large-scale machine learning problems, and could result in much higher computation overhead, i.e., $O(rt)$ decoding time. In this paper, we develop a new coded computation strategy, we call \emph{sparse code}, which achieves near \emph{optimal recovery threshold}, \emph{low computation overhead}, and \emph{linear decoding time} $O(nnz(C))$. We implement our scheme and demonstrate the advantage of the approach over both uncoded and current fastest coded strategies.
Towards More Efficient Stochastic Decentralized Learning: Faster Convergence and Sparse Communication
Zebang Shen · Aryan Mokhtari · Tengfei Zhou · Peilin Zhao · Hui Qian
Recently, the decentralized optimization problem is attracting growing attention. Most existing methods are deterministic with high per-iteration cost and have a convergence rate quadratically depending on the problem condition number. Besides, the dense communication is necessary to ensure the convergence even if the dataset is sparse. In this paper, we generalize the decentralized optimization problem to a monotone operator root finding problem, and propose a stochastic algorithm named DSBA that (1) converges geometrically with a rate linearly depending on the problem condition number, and (2) can be implemented using sparse communication only. Additionally, DSBA handles important learning problems like AUC-maximization which can not be tackled efficiently in the previous problem setting. Experiments on convex minimization and AUC-maximization validate the efficiency of our method.
Faster Derivative-Free Stochastic Algorithm for Shared Memory Machines
Bin Gu · Zhouyuan Huo · Cheng Deng · Heng Huang
Asynchronous parallel stochastic gradient optimization has been playing a pivotal role to solve large-scale machine learning problems in big data applications. Zeroth-order (derivative-free) methods estimate the gradient only by two function evaluations, thus have been applied to solve the problems where the explicit gradient calculations are computationally expensive or infeasible. Recently, the first asynchronous parallel stochastic zeroth-order algorithm (AsySZO) was proposed. However, its convergence rate is O(1/SQRT{T}) for the smooth, possibly non-convex learning problems, which is significantly slower than O(1/T) the best convergence rate of (asynchronous) stochastic gradient algorithm. To fill this gap, in this paper, we first point out the fundamental reason leading to the slow convergence rate of AsySZO, and then propose a new asynchronous stochastic zerothorder algorithm (AsySZO+). We provide a faster convergence rate O(1/bT) (b is the mini-batch size) for AsySZO+ by the rigorous theoretical analysis, which is a significant improvement over O(1/SQRT{T}). The experimental results on the application of ensemble learning confirm that our AsySZO+ has a faster convergence rate than the existing (asynchronous) stochastic zeroth-order algorithms.