ICML 2021 Accelerating Gossip SGD with Periodic Global Averaging Spotlight

Spotlight

Accelerating Gossip SGD with Periodic Global Averaging

Yiming Chen · Kun Yuan · Yingya Zhang · Pan Pan · Yinghui Xu · Wotao Yin

[ Abstract ] [ Visit Optimization 6 ] [ Paper ]

[ Paper ]

Abstract: Communication overhead hinders the scalability of large-scale distributed training. Gossip SGD, where each node averages only with its neighbors, is more communication-efficient than the prevalent parallel SGD. However, its convergence rate is reversely proportional to quantity

1 - β

$1-\beta$ which measures the network connectivity. On large and sparse networks where

1 - β \to 0

$1-\beta \to 0$ , Gossip SGD requires more iterations to converge, which offsets against its communication benefit. This paper introduces Gossip-PGA, which adds Periodic Global Averaging to accelerate Gossip SGD. Its transient stage, i.e., the iterations required to reach asymptotic linear speedup stage, improves from

Ω (β^{4} n^{3} / (1 - β)^{4})

$\Omega(\beta^4 n^3/(1-\beta)^4)$ to

Ω (β^{4} n^{3} H^{4})

$\Omega(\beta^4 n^3 H^4)$ for non-convex problems. The influence of network topology in Gossip-PGA can be controlled by the averaging period

H

$H$ . Its transient-stage complexity is also superior to local SGD which has order

Ω (n^{3} H^{4})

$\Omega(n^3 H^4)$ . Empirical results of large-scale training on image classification (ResNet50) and language modeling (BERT) validate our theoretical findings.

Chat is not available.