Schur-A*: Layer-wise Optimal Expert Pruning for Sparse MoEs via Schur-Complement Guided A* Search
Abstract
Sparse Mixture-of-Experts (MoE) language models enable conditional computation but face deployment challenges due to the "memory wall": while few experts are activated per token, the entire model must reside in memory. Existing expert pruning methods primarily rely on independent ranking, failing to account for the complex inter-dependencies and redundancies between experts. In this paper, we formulate post-training MoE pruning as a reconstruction-driven subset selection problem, aiming to minimize layer-output distortion under a cardinality constraint. We introduce SCHUR-A*, an algorithm that leverages A* search to achieve globally optimal expert selection within each layer. To maintain computational tractability, we derive a novel, admissible heuristic upper bound using a Schur-complement-based relaxation of the reconstruction objective. This tight bound allows for aggressive pruning of the search space while mathematically guaranteeing optimality. Furthermore, we propose an automated strategy to balance fidelity and memory reduction across heterogeneous layers via knee-point detection. Extensive experiments on Qwen3-30B-A3B demonstrate that SCHUR-A* significantly outperforms greedy and ranking-based baselines, maintaining comparable performance even under aggressive pruning ratios.