Contributed Talk & Poster
in
Workshop: 2nd Workshop on Advancing Neural Network Training : Computational Efficiency, Scalability, and Resource Optimization (WANT@ICML 2024)
AdaMeM: Memory Efficient Momentum for Adafactor
Nikhil Vyas · Depen Morwani · Sham Kakade
Adafactor is a memory efficient algorithm which does not maintain momentum and has near 0 memory overhead as compared to gradient descent. However it performs worse than Adam in many setups. Prior works have shown that this gap can be removed by adding momentum to Adafactor. This comes at the cost of increased memory requirements. In this work we use the ideas of low rank optimizers such as LoRA and GaLore to maintain momentum on a low rank subspace of the weights on top of Adafactor to give a new optimizer: AdaMeM. However unlike low rank optimizers we still utilize full rank gradients but maintain momentum only on the top SVD subspace of the gradients. We show results on language modelling for models of size 210M and 550M demonstrating improved performance over Adafactor and GaLore. We also give theoretical arguments supporting the design of AdaMeM.