Track: Statistical Learning Theory 4

Thu 12 July 7:00 - 7:20 PDT

Optimal Distributed Learning with Multi-pass Stochastic Gradient Methods

Junhong Lin · Volkan Cevher

We study generalization properties of distributed algorithms in the setting of nonparametric regression over a reproducing kernel Hilbert space (RKHS). We investigate distributed stochastic gradient methods (SGM), with mini-batches and multi-passes over the data. We show that optimal generalization error bounds can be retained for distributed SGM provided that the partition level is not too large. Our results are superior to the state-of-the-art theory, covering the cases that the regression function may not be in the hypothesis spaces. Particularly, our results show that distributed SGM has a smaller theoretical computational complexity, compared with distributed kernel ridge regression (KRR) and classic SGM.

Thu 12 July 7:20 - 7:40 PDT

Byzantine-Robust Distributed Learning: Towards Optimal Statistical Rates

Dong Yin · Yudong Chen · Kannan Ramchandran · Peter Bartlett

In this paper, we develop distributed optimization algorithms that are provably robust against Byzantine failures---arbitrary and potentially adversarial behavior, in distributed computing systems, with a focus on achieving optimal statistical performance. A main result of this work is a sharp analysis of two robust distributed gradient descent algorithms based on median and trimmed mean operations, respectively. We prove statistical error rates for all of strongly convex, non-strongly convex, and smooth non-convex population loss functions. In particular, these algorithms are shown to achieve order-optimal statistical error rates for strongly convex losses. To achieve better communication efficiency, we further propose a median-based distributed algorithm that is provably robust, and uses only one communication round. For strongly convex quadratic loss, we show that this algorithm achieves the same optimal error rate as the robust distributed gradient descent algorithms.

Thu 12 July 7:40 - 7:50 PDT

Functional Gradient Boosting based on Residual Network Perception

Atsushi Nitanda · Taiji Suzuki

Residual Networks (ResNets) have become state-of-the-art models in deep learning and several theoretical studies have been devoted to understanding why ResNet works so well.One attractive viewpoint on ResNet is that it is optimizing the risk in a functional space by consisting of an ensemble of effective features. In this paper, we adopt this viewpoint to construct a new gradient boosting method, which is known to be very powerful in data analysis.To do so, we formalize the boosting perspective of ResNet mathematically using the notion of functional gradients and propose a new method called ResFGB for classification tasks by leveraging ResNet perception.Two types of generalization guarantees are provided from the optimization perspective: one is the margin bound and the other is the expected risk bound by the sample-splitting technique.Experimental results show superior performance of the proposed method over state-of-the-art methods such as LightGBM.

Thu 12 July 7:50 - 8:00 PDT

Binary Classification with Karmic, Threshold-Quasi-Concave Metrics

Bowei Yan · Sanmi Koyejo · Kai Zhong · Pradeep Ravikumar

Complex performance measures, beyond the popular measure of accuracy, are increasingly being used in the context of binary classification. These complex performance measures are typically not even decomposable, that is, the loss evaluated on a batch of samples cannot typically be expressed as a sum or average of losses evaluated at individual samples, which in turn requires new theoretical and methodological developments beyond standard treatments of supervised learning. In this paper, we advance this understanding of binary classification for complex performance measures by identifying two key properties: a so-called Karmic property, and a more technical threshold-quasi-concavity property, which we show is milder than existing structural assumptions imposed on performance measures. Under these properties, we show that the Bayes optimal classifier is a threshold function of the conditional probability of positive class. We then leverage this result to come up with a computationally practical plug-in classifier, via a novel threshold estimator, and further, provide a novel statistical analysis of classification error with respect to complex performance measures.

Main Navigation

Session

Statistical Learning Theory 4

Optimal Distributed Learning with Multi-pass Stochastic Gradient Methods

Byzantine-Robust Distributed Learning: Towards Optimal Statistical Rates

Functional Gradient Boosting based on Residual Network Perception

Binary Classification with Karmic, Threshold-Quasi-Concave Metrics