MuLoCo: Muon is a Practical Inner Optimizer for DiLoCo
Benjamin Thérien ⋅ Xiaolong Huang ⋅ Aaron Defazio ⋅ Irina Rish ⋅ Eugene Belilovsky
Abstract
DiLoCo is a powerful framework for training large language models (LLMs) under networking constraints, allowing for increased parallelism and accelerator utilization in data center settings. A critical but often overlooked factor in DiLoCo’s behavior is the choice of inner optimizer, which shapes the pseudogradient used by the outer optimizer. Given the recent success of Muon relative to AdamW for data parallel (DP) training, in this work, we examine how Muon's normalized optimizer steps can affect the pseudogradient's quality. Empirically, we find that, relative to AdamW, Muon yields more \emph{directionally correct} pseudogradients as the number of workers ($K$) is increased. In our experiments pre-training language models, we conduct extensive hyperparameter tuning across 150M, 416M, 914M, 1.76B, and 3.1B models for DiLoCo, MuLoCo, AdamW DP, and Muon DP. Consistently across all scales, we find that with $K\geq1$ workers, MuLoCo (Muon inner optimizer DiLoCo) achieves superior performance to DiLoCo in absolute terms and for $K>2$ it outperforms DiLoCo relative to their data parallel baselines, while being compatible with quantization, streaming, and long synchronization intervals. At $K=1$, we find that MuLoCo can even outperform the data-parallel gold standard while having larger optimal and critical batch sizes.
Successful Page Load