$\phi$-Balancing for Mixture-of-Experts Training
Lizhang Chen ⋅ Jonathan Li ⋅ Qi Wang ⋅ Runlong Liao ⋅ Shuozhe Li ⋅ Chen Liang ⋅ Ni Lao ⋅ qiang liu
Abstract
Mixture-of-Experts (MoE) models rely on balanced expert utilization to fully realize their scalability. However, existing load-balancing methods are largely heuristic and operate on mini-batch assignment statistics, introducing bias relative to population-level objectives. We propose $\phi$-balancing, a principled framework that directly targets population-level expert balance by minimizing a Schur-convex potential of the expected routing distribution. Using convex duality, we derive an equivalent min-max formulation and obtain a simple online algorithm via mirror descent, yielding an efficient EMA-based routing adjustment with negligible overhead. Across large-scale pretraining and downstream fine-tuning, $\phi$-balancing consistently outperforms prior Switch-style and loss-free baselines, demonstrating more stable and effective expert utilization.
Successful Page Load