PuzzleMoE: Efficient Compression of Large Mixture-of-Experts Models via Sparse Expert Merging and Bit-packed inference
Yushu Zhao ⋅ Zheng Wang ⋅ Minjia Zhang
Abstract
Mixture-of-Experts (MoE) have shown strong potential in scaling language models efficiently by activating only a small subset of experts per input. However, their deployment remains limited due to the high memory overhead associated with storing all expert parameters, particularly as the number of experts increases. To address this challenge, prior works have explored expert dropping and merging strategies; however, they often suffer from notable performance drop especially at high compression ratios due to their reliance on coarse-grained tensor- or expert-level operations. In this paper, we introduce PuzzleMoE, the first MoE merging method to enable fine-grained element-wise merging while achieving both high accuracy and inference speed, via two key innovations: First, PuzzleMoE performs sparse expert merging by identifying element-wise weight redundancy and specialization. It introduces a dual-mask approach to capture both shared and expert-specific salient parameters. Second, to avoid the overhead of storing masks and signs, we introduce a bit-packed encoding scheme that reuses underutilized exponent bits, enabling efficient MoE inference on GPUs. Extensive experiments demonstrate that PuzzleMoE outperforms prior MoE compression methods by up to 16.7\% on MMLU at 50\% compression ratio, and achieves up to 1.80$\times$ end-to-end inference throughput gain.
Successful Page Load