RMNP: Row-Momentum Normalized Preconditioning for Scalable Matrix-Based Optimization
Shenyang Deng ⋅ Zhuoli Ouyang ⋅ Ruochen Jin ⋅ Tianyu Pang ⋅ Zihang Liu ⋅ Shuhua Yu ⋅ Yaoqing Yang
Abstract
Preconditioned adaptive methods have gained significant attention for training deep neural networks, as they capture rich curvature information. The central challenge in this field lies in balancing preconditioning effectiveness with computational efficiency of implementing the preconditioner. Among recent advances, \textsc{MUON} stands out by using Newton-Schulz iteration to obtain preconditioned updates without explicitly constructing the preconditioning matrix. In this paper, we introduce \textsc{RMNP} (Row Momentum Normalized Preconditioning), an optimizer that replaces Newton-Schulz iteration with a simple row-wise $\ell_2$ normalization operation, motivated by the empirically observed diagonal block structure of the Transformer layerwise Hessian. This substitution reduces the per-iteration complexity from $\mathcal{O}(mn\cdot\min(m,n))$ to $\mathcal{O}(mn)$ for an $m\times n$ weight matrix while maintaining comparable optimization performance. Theoretically, we establish convergence guarantees for \textsc{RMNP} in the non-convex setting that match recent results for \textsc{Muon} optimizers, achieving the information-theoretic minimax optimal complexity. Extensive experiments on large language model pretraining show that \textsc{RMNP} delivers competitive optimization performance compared with \textsc{Muon} while substantially reducing preconditioning process wall-clock time. Our code is available at \href{https://anonymous.4open.science/r/RMNP-317C/}{link}.
Successful Page Load