ICML Poster Double Stochasticity Gazes Faster: Snap-Shot Decentralized Stochastic Gradient Tracking Methods

Poster

Double Stochasticity Gazes Faster: Snap-Shot Decentralized Stochastic Gradient Tracking Methods

Hao Di · Haishan Ye · Xiangyu Chang · Guang Dai · Ivor Tsang

Hall C 4-9 #2711

[ Abstract ] [ Paper PDF ]

[ Poster]

Abstract: In decentralized optimization,

m

$m$ agents form a network and only communicate with their neighbors, which gives advantages in data ownership, privacy, and scalability. At the same time, decentralized stochastic gradient descent (

SGD

$\texttt{SGD}$ ) methods, as popular decentralized algorithms for training large-scale machine learning models, have shown their superiority over centralized counterparts. Distributed stochastic gradient tracking

DSGT

$\texttt{DSGT}$ has been recognized as the popular and state-of-the-art decentralized

SGD

$\texttt{SGD}$ method due to its proper theoretical guarantees. However, the theoretical analysis of

DSGT

$\texttt{DSGT}$ shows that its iteration complexity is

\tilde{O} (\frac{{\bar{σ}}^{2}}{m μ ε} + \frac{\sqrt{L} \bar{σ}}{μ (1 - λ_{2} (W))^{1 / 2} C_{W} \sqrt{ε}})

$\tilde{\mathcal{O}} \left(\frac{\bar{\sigma}^2}{m\mu \varepsilon} + \frac{\sqrt{L}\bar{\sigma}}{\mu(1 - \lambda_2(W))^{1/2} C_W \sqrt{\varepsilon} }\right)$ , where the doubly stochastic matrix

W

$W$ represents the network topology and

C_{W}

$C_W$ is a parameter that depends on

W

$W$ . Thus, it indicates that the convergence property of

DSGT

$\texttt{DSGT}$ is heavily affected by the topology of the communication network. To overcome the weakness of

DSGT

$\texttt{DSGT}$ , we resort to the snap-shot gradient tracking skill and propose two novel algorithms, snap-shot

DSGT

$\texttt{DSGT}$ (

SS-DSGT

$\texttt{SS-DSGT}$ ) and accelerated snap-shot

DSGT

$\texttt{DSGT}$ (

ASS-DSGT

$\texttt{ASS-DSGT}$ ). We further justify that

SS-DSGT

$\texttt{SS-DSGT}$ exhibits a lower iteration complexity compared to

DSGT

$\texttt{DSGT}$ in the general communication network topology. Additionally,

ASS-DSGT

$\texttt{ASS-DSGT}$ matches

DSGT

$\texttt{DSGT}$ 's iteration complexity

O (\frac{{\bar{σ}}^{2}}{m μ ε} + \frac{\sqrt{L} \bar{σ}}{μ (1 - λ_{2} (W))^{1 / 2} \sqrt{ε}})

$\mathcal{O}\left( \frac{\bar{\sigma}^2}{m\mu \varepsilon} + \frac{\sqrt{L}\bar{\sigma}}{\mu (1 - \lambda_2(W))^{1/2}\sqrt{\varepsilon}} \right)$ under the same conditions as

DSGT

$\texttt{DSGT}$ . Numerical experiments validate

SS-DSGT

$\texttt{SS-DSGT}$ 's superior performance performance in the general communication network topology and exhibit better practical performance of

ASS-DSGT

$\texttt{ASS-DSGT}$ on the specified

W

$W$ compared to

DSGT

$\texttt{DSGT}$ .

Chat is not available.