ToMoE: Converting Dense Large Language Models to Mixture-of-Experts through Dynamic Structural Pruning
Abstract
Large Language Models (LLMs) demonstrate remarkable capabilities but face deployment challenges due to their high computational demands. Traditional pruning methods reduce these costs by permanently removing parameters, which inevitably leads to performance degradation. To mitigate this issue, we propose ToMoE, a method that transforms dense LLMs into Mixture-of-Experts (MoE) models by uncovering experts inherently present within dense models, without requiring any weight updates. ToMoE leverages dynamic structural pruning to unify expert construction and router training in a single stage, achieving consistently strong performance. Remarkably, even without fine-tuning \revise{the model weights}, ToMoE consistently outperforms state-of-the-art pruning and MoE techniques across Phi-2, LLaMA-2, LLaMA-3, and Qwen-2.5 models. The code for this paper is available at https://github.com/gaosh/ToMoE.