Poster

Convergence Rate Analysis of the AdamW-Style Shampoo: Unifying One-sided and Two-Sided Preconditioning

Huan Li ⋅ Yiming Dong ⋅ Zhouchen Lin

Abstract

This paper studies the AdamW-style Shampoo optimizer, an effective implementation of the classical Shampoo that notably won the external tuning track of the AlgoPerf neural network training algorithm competition. Our analysis unifies one-sided and two-sided preconditioning and establishes the convergence rate $\frac{1}{K}\sum_{k=1}^KE[|||\nabla f(X_k)|||]\leq O(\frac{\sqrt{m+n}C}{K^{1/4}})$ measured by nuclear norm (denoted as $|||\cdot|||$ to display correctly in OpenReview), where $K$ represents the iteration number, $(m,n)$ denotes the size of matrix parameters, and $C$ matches the constant in the optimal convergence rate of SGD. Theoretically, we have $||\nabla f(X)||\leq|||\nabla f(X)|||\leq\sqrt{\min(m,n)}||\nabla f(X)||$ (denote $||\cdot||$ as the Frobenius norm to display correctly in OpenReview), supporting that our convergence rate can be considered to be analogous to the optimal $\frac{1}{K}\sum_{k=1}^K E[||\nabla f(X_k)||]\leq O(\frac{C}{K^{1/4}})$ convergence rate of SGD in the ideal case of $|||\nabla f(X)|||= \Theta(\sqrt{\min(m,n)})||\nabla f(X)||$ and balanced $m$ and $n$.