Rethinking Convergence in MoE Training: The Role of Routing Sparsity
Weihao Zhu ⋅ Long Shi ⋅ Kang Wei ⋅ Zhe Wang ⋅ Yipeng Zhou ⋅ Haixia Zhang
Abstract
In Mixture-of-Experts (MoE) training, sparse routing, i.e., activating only the top-$K$ experts per token, is essential for balancing convergence speed and computational cost. However, existing works typically choose $K$ empirically, without theoretical guidance. To address this gap, we characterize the convergence behavior of MoE training using stochastic optimization theory. Specifically, we derive a convergence upper bound of $\mathcal{O}\left(\frac{1+M/K}{\sqrt{T}}\right)$, where $T$ is the number of training iterations and $M$ is the total number of experts per MoE layer. This result guarantees convergence and shows that increasing $K$ can accelerate training. By further fixing the total computational budget $R$ (in FLOPs), we obtain a refined bound of $\mathcal{O}\left(\sqrt{\frac{K}{R}} + \frac{M}{\sqrt{K R}}\right)$, which is convex in $K$ and implies the existence of an optimal $K^{*}\in[1,M]$ that achieves the best convergence performance. Extensive experiments validate our theoretical analysis under diverse settings.
Successful Page Load