Poster
Lean and Mean Adaptive Optimization via Subset-Norm and Subspace-Momentum with Convergence Guarantees
Thien Nguyen · Huy Nguyen
West Exhibition Hall B2-B3 #W-513
Abstract:
We introduce two complementary techniques for efficient optimization that reduce memory requirements while accelerating training oflarge-scale neural networks. The first technique, Subset-Norm step size, generalizes AdaGrad-Norm and AdaGrad(-Coordinate) through step-size sharing. Subset-Norm (SN) reduces AdaGrad's memory footprint from $O(d)$ to $O(\sqrt{d})$, where $d$ is the model size. For non-convex smooth objectives under coordinate-wise sub-gaussian noise, we show a noise-adapted high-probability convergence guarantee with improved dimensional dependence of SN over existing methods. Our second technique, Subspace-Momentum, reduces the momentum state's memory footprint by restricting momentum to a low-dimensional subspace while performing SGD in the orthogonal complement. We prove high-probability convergence rates for Subspace-Momentum under standard assumptions. Empirical evaluation on pre-training and fine-tuning LLMs demonstrates the effectiveness of our methods. For instance, combining Subset-Norm with Subspace-Momentum achieves Adam's validation perplexity for LLaMA 1B in approximately half the trainingtokens (6.8B vs 13.1B) while reducing Adam's optimizer-states memory footprint by more than 80\% with minimal additional hyperparameter tuning.
Lay Summary:
Optimizers for training deep neural networks like Adam maintain internal states that consume substantial amount of memory for large models. We introduce novel algorithms that not only reduce memory consumptions but also achieve faster convergence than Adam and other recent memory efficient optimizers like GaLore and Adam, both experimentally and theoretically.
Live content is unavailable. Log in and register to view live content