DTop-p MoE: Sparsity-Controlled Dynamic Top-p MoE for Foundation Model Pre-training
Can Jin ⋅ Hongwu Peng ⋅ Mingcan Xiang ⋅ Qixin Zhang ⋅ Xiangchi Yuan ⋅ Amit Hasan ⋅ Ohi Dibua ⋅ Yifan Gong ⋅ Yan Kang ⋅ Dimitris Metaxas
Abstract
Sparse Mixture-of-Experts architectures are essential for scaling model capacity efficiently, yet the standard Top-$k$ routing imposes a rigid sparsity pattern that ignores the intrinsic variance in token difficulty and layer-specific computational needs. While Top-$p$ routing offers a flexible alternative, we demonstrate that existing naive Top-$p$ implementations with fixed global probability thresholds provide only marginal gains over Top-$k$, suffer from hyperparameter sensitivity, and result in uncontrolled computational costs. In this paper, we propose $\texttt{DTop-}p$, a sparsity-controllable dynamic routing mechanism. To overcome the non-differentiability of the MoE sparsity level - the Top-$p$ threshold, we utilize a Proportional-Integral controller that dynamically learns the Top-$p$ probability threshold to align the running sparsity with a user-defined budget. Furthermore, we introduce dynamic routing normalization to adaptively rescale logits, enabling distinct expert selection patterns across layers under a global sparsity constraint. Extensive experiments on Large Language Models and Diffusion Transformers demonstrate that $\texttt{DTop-}p$ consistently outperforms both Top-$k$ and fixed Top-$p$ baselines while matching the average FLOPs of Top-$k$ MoE. Our analysis confirms that $\texttt{DTop-}p$ exhibits strong scaling properties across expert granularity, total expert capacity, model size, and dataset size, offering a robust and efficient MoE framework for foundation model pre-training.
Successful Page Load