RaGEP: Rank-aware Geometric Expert Pruning for Mixture-of-Experts Language Models
Abstract
Sparse Mixture-of-Experts (MoE) architectures scale model capacity efficiently but suffer from massive static parameter footprints, creating significant deployment burdens on memory-constrained hardware. Existing post-training pruning methods often rely on scalar statistics, ignoring the representational geometry of expert feature spaces. This leads to sub-optimal resource allocation across layers and the retention of redundant experts. To address this, we propose a Rank-aware Geometric Expert Pruning (RaGEP) framework to compress MoE models by analyzing the geometric properties of expert activations. First, in the inter-layer allocation stage, we introduce a Rank-aware budget allocation mechanism that adaptively assigns expert budgets based on the effective rank of layer-wise representations. Second, in the intra-layer selection stage, we propose a Spectral-Salience Pruning metric that harmonizes subspace orthogonality and activation magnitude to identify high-energy orthogonal experts. Extensive experiments across MoE models of different scales show that our method consistently outperforms state-of-the-art baselines on a diverse set of zero-shot tasks, while reducing model size and inference cost. Code is available at supplementary material.