Understanding MARS: When Scaling Momentum Provably Helps
Egor Shulgin ⋅ Tamaz Gadaev ⋅ Sarit Khirirat ⋅ Peter Richtarik
Abstract
MARS has recently emerged as a state-of-the-art optimizer, consistently outperforming AdamW in large language model (LLM) training. It modifies the momentum-based variance reduction (MVR) update by introducing a multiplicative coefficient $\gamma$, which scales the momentum correction term. However, the existing theory of Yuan et al. (2025) does not explain why this modification improves the convergence of MARS over MVR. In this paper, we provide a rigorous theoretical explanation for the superiority of MARS over MVR. We introduce the novel similarity condition, called **$\gamma$-similarity**, which generalizes standard similarity and smoothness assumptions for analyzing stochastic algorithms. Under this condition, we derive gradient complexity guarantees for MARS, which explicitly depend on $\gamma$ and a $\gamma$-similarity constant $\delta_\gamma$. We prove that by appropriately tuning $\gamma \in [0,1]$, MARS achieves strictly lower complexity than MVR. Finally, experiments on GPT pretraining corroborate our theoretical findings, demonstrating that MARS with an optimal choice of $\gamma$ improves token efficiency over MVR, and yields substantial gains compared to AdamW.
Successful Page Load