ReMoE: Boosting Expert Reuse through Router Fine-Tuning in Memory-Constrained MoE LLM Inference
Abstract
Fine-grained Mixture-of-Experts (MoE) models sparsely activate a subset of parameters, significantly reducing computational costs while maintaining performance. However, in memory-constrained inference scenarios, only a small set of experts can be cached. Experts not in the cache must be fetched from slow external storage (e.g., UFS), leading to frequent evictions and substantial I/O overhead. We propose ReMoE, a router fine-tuning framework designed to boost token-wise expert reuse. By introducing a temporal inductive bias, ReMoE encourages the model to consistently select the same experts over time, which aligns the routing behavior with cache locality constraints, reducing the need to fetch experts from storage without adding any extra computation during inference. Experiments on DeepSeek and Qwen models show that ReMoE improves the expert reuse rate by 26\%. Under a standard LRU caching policy simulation, ReMoE improves the cache hit rate by 15.7\%, corresponds to a 7.8\% reduction in median latency and an 8.5\% increase in proxy throughput, while maintaining downstream task performance.