SIMoE: A Probabilistic Framework for Cardinality-Constrained Routing in Mixture-of-Experts
Heng Zhao ⋅ Zilei Shao ⋅ Guy Van den Broeck ⋅ Zhe Zeng
Abstract
Mixture-of-Experts (MoE) models scale by activating only a small subset of experts per token, but standard deterministic top-$k$ routing is non-differentiable and trained using surrogate gradients that ignore the discrete expert selection used at inference. We introduce SIMoE routing by modeling expert selection as a stochastic latent variable, casting it as probabilistic inference over discrete expert subsets under explicit cardinality constraints. First, we purpose SIMoE Exact-$k$ routing, which samples discrete $k$-expert subsets and propagates gradients through tractable inclusion probabilities for stochastic expert selection. We then extend this to SIMoE Dynamic-$k$ routing, where both training and inference constrain the routing cardinality to the same predefined range, allowing adaptive expert allocation per token. Across different benchmarks on OLMoE and Qwen MoE backbones, SIMoE Exact-$k$ improves expert utilization and routing diversity over competitive baselines, and Dynamic-$k$ achieves comparable performance with fewer activated experts.
Successful Page Load