Breaking the Echo Chamber: A Dynamic Ensemble Pruning Perspective on MoE
Abstract
We introduce Mahalanobis-Pruned Mixture-of-Experts (MP-MoE), a novel routing framework that approaches expert selection from the perspective of ensemble pruning. Existing Mixture-of-Experts (MoE) routing strategies often suffer from representation collapse due to greedy top-k selection mechanisms or rely on complex auxiliary regularization terms that may compromise model performance. To address these issues, we formulate routing as a diversity-aware subset selection problem and optimize a Mahalanobis-distance-based objective that explicitly enhances expert diversity. Specifically, we demonstrate that the expert co-occurrence matrix effectively captures inter-expert correlations, allowing us to efficiently model the covariance structure required for distance computation without accessing expert parameters. Furthermore, we devise a greedy strategy for the routing mechanism, backed by theoretical approximation guarantees, rendering it a plug-and-play module with negligible overhead. MP-MoE increases wall-clock training time by approximately 3\%, while incurring no additional latency at inference time. Extensive experiments demonstrate that during the pre-training of the large language model, our method consistently outperforms the baseline by 1-3 percentage points across a broad range of benchmarks.