Less Token, More Signal: MoE Expert Pruning via Critical Token Selection
Zeliang Zong ⋅ Kai Zhang ⋅ Yarong Wang ⋅ wenming tan ⋅ Ye Ren ⋅ Jilin Hu
Abstract
Mixture-of-Experts (MoE) architectures provide strong scalability for large language models, but their large expert parameter footprint poses challenges for efficient deployment. Expert pruning is widely used to reduce model size and inference cost; however, existing approaches are token-agnostic, treating all tokens equally when estimating expert importance. This uniform treatment dilutes the contributions of informative tokens and leads to suboptimal pruning decisions. To address this fundamental limitation, we propose **Step** (**S**elective **T**oken-guided **E**xpert **P**runing), a token-aware framework that rethinks expert pruning from the perspective of selective token guidance. By incorporating loss-aware expert evaluation and a lightweight knowledge-preserving mechanism, **Step** reduces information loss while removing redundant experts. Extensive experiments across different MoE architectures and model scales demonstrate the effectiveness of **Step**. On the 30B Qwen3 MoE model with 50\% expert sparsity, **Step** achieves nearly a 50\% reduction in memory usage with minimal performance degradation, delivers a 1.5$\times$ throughput improvement, and completes the entire pruning process within 10 minutes.
Successful Page Load