Mining Tensor/Neuron-Level Sparsity to Maximize Mixture-of-Experts Potential in Post-Training and Inference
Weilin Cai ⋅ Le Qin ⋅ Shwai He ⋅ Junwei Cui ⋅ Ang Li ⋅ Jiayi Huang
Abstract
Mixture of Experts (MoE) has emerged as a mainstream architecture for Large Language Models (LLMs), balancing computational efficiency with model scalability. While prior work has explored increasing tensor-level sparsity via finer-grained expert configurations during pre-training, we identify significant unexploited sparsity at both the tensor and neuron levels during post-training and inference. To leverage this, we propose complete expert partition for post-training and threshold-based token-expert dropping for inference. These techniques improve the Mixtral-8$\times$7B model's average accuracy by 1% across nine downstream benchmarks (notably 4% on GSM8K). To further optimize the accuracy-efficiency trade-off for inference, we introduce dual-threshold token-expert dropping with partial expert partition and reconstruction. Our approach yields a 1.19$\times$ MoE speedup and a 0.5% accuracy gain on Mixtral-8$\times$7B when combining post-training and inference optimizations. For inference-only optimization on OLMoE-Instruct and DeepSeek-V2-Lite-Chat, we achieve up to 1.41$\times$ MoE speedup with a negligible accuracy loss ($<$0.5%).
Successful Page Load