Processing math: 100%
Skip to yearly menu bar Skip to main content


Poster

D2: Decentralized Training over Decentralized Data

Hanlin Tang · Xiangru Lian · Ming Yan · Ce Zhang · Ji Liu

Hall B #207

Abstract: While training a machine learning model using multiple workers, each of which collects data from its own data source, it would be useful when the data collected from different workers are {\em unique} and {\em different}. Ironically, recent analysis of decentralized parallel stochastic gradient descent (D-PSGD) relies on the assumption that the data hosted on different workers are {\em not too different}. In this paper, we ask the question: {\em Can we design a decentralized parallel stochastic gradient descent algorithm that is less sensitive to the data variance across workers?} In this paper, we present D2, a novel decentralized parallel stochastic gradient descent algorithm designed for large data variance \xr{among workers} (imprecisely, decentralized'' data). The core of D2 is a variance reduction extension of D-PSGD. It improves the convergence rate from O(σnT+(nζ2)13T2/3) to O(σnT) where ζ2 denotes the variance among data on different workers. As a result, D2 is robust to data variance among workers. We empirically evaluated D2 on image classification tasks, where each worker has access to only the data of a limited set of labels, and find that D2 significantly outperforms D-PSGD.

Live content is unavailable. Log in and register to view live content