Track: Parallel and Distributed Learning 3

Fri 13 July 7:00 - 7:20 PDT

The Hidden Vulnerability of Distributed Learning in Byzantium

El Mahdi El Mhamdi · Rachid Guerraoui · Sébastien Rouault

While machine learning is going through an era of celebrated success,concerns have been raised about the vulnerability of its backbone: stochastic gradient descent (SGD).Recent approaches have been proposed to ensure the robustness of distributed SGD against adversarial (Byzantine) workers sending \emph{poisoned} gradients during the training phase.Some of these approaches have been proven \emph{Byzantine--resilient}: they ensure the \emph{convergence} of SGD despite the presence of a minority of adversarial workers.We show in this paper that \emph{convergence is not enough}.In high dimension $d \gg 1$,an adver\-sary can build on the loss function's non--convexity to make SGD converge to \emph{ineffective} models.More precisely, we bring to light that existing Byzantine--resilient schemes leave a \emph{margin of poisoning} of $\bigOmega\left(f(d)\right)$,where $f(d)$ increases at least like $\sqrt[p]{d~}$.Based on this \emph{leeway}, we build a simple attack,and experimentally show its strong to utmost effectivity on CIFAR--10 and MNIST.We introduce \emph{Bulyan}, and prove it significantly reduces the attackers leeway to a narrow $\bigO\,( \sfrac{1}{\sqrt{d~}})$ bound.We empirically show that Bulyan does not suffer the fragility of existing aggregation rules and,at a reasonable cost in terms of required batch size,achieves convergence \emph{as if} only non--Byzantine gradients had been used to update the model.

Fri 13 July 7:20 - 7:40 PDT

Asynchronous Byzantine Machine Learning (the case of SGD)

Georgios Damaskinos · El Mahdi El Mhamdi · Rachid Guerraoui · Rhicheek Patra · Mahsa Taziki

Asynchronous distributed machine learning solutions have proven very effective so far, but always assuming perfectly functioning workers. In practice, some of the workers can however exhibit Byzantine behavior, caused by hardware failures, software bugs, corrupt data, or even malicious attacks. We introduce Kardam, the first distributed asynchronous stochastic gradient descent (SGD) algorithm that copes with Byzantine workers. Kardam consists of two complementary components: a filtering and a dampening component. The first is scalar-based and ensures resilience against 1/3 Byzantine workers. Essentially, this filter leverages the Lipschitzness of cost functions and acts as a self-stabilizer against Byzantine workers that would attempt to corrupt the progress of SGD. The dampening component bounds the convergence rate by adjusting to stale information through a generic gradient weighting scheme. We prove that Kardam guarantees almost sure convergence in the presence of asynchrony and Byzantine behavior, and we derive its convergence rate. We evaluate Kardam on the CIFAR100 and EMNIST datasets and measure its overhead with respect to non Byzantine-resilient solutions. We empirically show that Kardam does not introduce additional noise to the learning procedure but does induce a slowdown (the cost of Byzantine resilience) that we both theoretically and empirically show to be less than f/n, where f is the number of Byzantine failures tolerated and n the total number of workers. Interestingly, we also empirically observe that the dampening component is interesting in its own right for it enables to build an SGD algorithm that outperforms alternative staleness-aware asynchronous competitors in environments with honest workers.

Fri 13 July 7:40 - 7:50 PDT

DRACO: Byzantine-resilient Distributed Training via Redundant Gradients

Lingjiao Chen · Hongyi Wang · Zachary Charles · Dimitris Papailiopoulos

Distributed model training is vulnerable to byzantine system failures and adversarial compute nodes, i.e., nodes that use malicious updates to corrupt the global model stored at a parameter server (PS). To guarantee some form of robustness, recent work suggests using variants of the geometric median as an aggregation rule, in place of gradient averaging. Unfortunately, median-based rules can incur a prohibitive computational overhead in large-scale settings, and their convergence guarantees often require strong assumptions. In this work, we present DRACO, a scalable framework for robust distributed training that uses ideas from coding theory. In DRACO, each compute node evaluates redundant gradients that are used by the parameter server to eliminate the effects of adversarial updates. DRACO comes with problem-independent robustness guarantees, and the model that it trains is identical to the one trained in the adversary-free setup. We provide extensive experiments on real datasets and distributed setups across a variety of large-scale models, where we show that DRACO is several times, to orders of magnitude faster than median-based approaches.

Fri 13 July 7:50 - 8:00 PDT

Communication-Computation Efficient Gradient Coding

Min Ye · Emmanuel Abbe

This paper develops coding techniques to reduce the running time of distributed learning tasks. It characterizes the fundamental tradeoff to compute gradients in terms of three parameters: computation load, straggler tolerance and communication cost. It further gives an explicit coding scheme that achieves the optimal tradeoff based on recursive polynomial constructions, coding both across data subsets and vector components. As a result, the proposed scheme allows to minimize the running time for gradient computations. Implementations are made on Amazon EC2 clusters using Python with mpi4py package. Results show that the proposed scheme maintains the same generalization error while reducing the running time by $32\%$ compared to uncoded schemes and $23\%$ compared to prior coded schemes focusing only on stragglers (Tandon et al., ICML 2017).

Main Navigation

Session

Parallel and Distributed Learning 3

The Hidden Vulnerability of Distributed Learning in Byzantium

Asynchronous Byzantine Machine Learning (the case of SGD)

DRACO: Byzantine-resilient Distributed Training via Redundant Gradients

Communication-Computation Efficient Gradient Coding