EasyBalance: Cross-Layer Load Balancing in Distributed MoE Inference
Abstract
Load Balancing has emerged as a critical problem in expert-parallel distributed inference of Mixture-of-Experts (MoE) models. As routing distributions are typically skewed across experts, devices hosting lighter-loaded experts must idle to wait for the heaviest during expert computing, leading to inefficiency. Existing load-balancing approaches primarily rely on expert replication or migration within each layer, which introduce additional overhead and limit their flexibility and scalability. To address this problem, we propose EasyBalance, a \textbf{cross-layer} load balancing strategy for expert-parallel MoE inference. EasyBalance requires no modifications to the expert-device mapping, enabling instant adaptability and incurring essentially no additional overhead. Our key insights are that (1) experts from other layers can be viewed as naturally redundant, and (2) expert workloads of multiple layers, if from different micro-batches, can be jointly executed. Based on these observations, EasyBalance greedily schedules subsets of cross-layer workloads at each expert-computation stage, while deferring the remaining workloads for future balancing opportunities. Extensive experiments across different models, tasks, and parallelism configurations demonstrate that EasyBalance consistently accelerates expert-parallel inference, reducing GPU idling by uniformly over 40\%.