Delving into Muon and Beyond: Deep Analysis and Extensions
Abstract
The Muon optimizer has recently attracted considerable attention for its strong empirical performance and use of orthogonalized updates on matrix-shaped parameters, yet its underlying mechanisms and relationship to adaptive optimizers such as Adam remain insufficiently understood. In this work, we aim to address these questions through a unified spectral perspective. Specifically, we view Muon as the ( p = 0 ) endpoint of a family of spectral transformations of the form ( \boldsymbol{U} \boldsymbol{\Sigma}^{p} \boldsymbol{V}^{\top} ), and consider additional variants with ( p = \frac{1}{2} ), ( p = \frac{1}{4} ), and ( p = 1 ). These transformations are applied to both first-moment updates, as in momentum SGD, and to root-mean-square (RMS) normalized gradient updates as in Adam. To enable efficient computation, we develop a coupled Newton iteration that avoids explicit singular value decomposition. Across controlled experiments, we find that RMS-normalized updates yield more stable optimization than first-moment updates. Moreover, while spectral compression provides strong stabilization benefits under first-moment updates, the Muon update (( p = 0 )) does not consistently outperform Adam. These results suggest that Muon is best understood as an effective form of spectral normalization, but not a universally superior optimization method.