Attribution-Guided and Coverage-Maximized Pruning for Structural MoE Compression
Yifu Ding ⋅ jiacheng wang ⋅ Ge Yang ⋅ Yongcheng Jing ⋅ Jinyang Guo ⋅ Xianglong Liu ⋅ Dacheng Tao
Abstract
Mixture-of-Experts (MoE) models scale compute efficiently, yet they remain expensive to deploy due to substantial memory footprint and inference overhead. Prior methods mainly operate at the expert level, either removing whole experts or ranking experts by importance. However, such expert-wise decisions are too coarse to identify redundancy, and often misallocate pruning budgets and limits compression. This issue worsens in large MoEs with dynamic routing and heterogeneous experts. To alleviate this dilemma, we for the first time observe that information in MoE experts is highly concentrated in a few channels, leaving substantial redundancy even in "high importance" experts. Accordingly, we propose a structural pruning framework tailored for MoEs, reforming the prune-ratio objective to maximizing channel-score coverage via an efficient attribution-based approximation. Experiments on DeepSeek and Qwen MoEs retain accuracy under 50\% or 25\% pruning joinly with 4-bit quantization, reducing the memory footprint of Qwen3-30B-A3B by 5.27$\times$, and outperforming state-of-the-art baselines under diverse benchmarks.
Successful Page Load