IO-Adam: Rethinking Memory-Efficient Adaptive Optimizers from Gradient Computation
Yiting Chen ⋅ Zongwei Huo ⋅ Junchi Yan
Abstract
Adaptive Moment Estimation (Adam) is one of the most popular and often the default stochastic optimizers for deep neural network training. Using first- and second-moment estimation, Adam provides adaptive learning rates for each parameter, significantly outperforming Stochastic Gradient Descent (SGD). However, as deep neural networks become larger, estimating the first and second moments consumes substantial memory. It motivates various methods to reduce memory usage for adaptive optimizers. In this paper, we propose to rethink the first and second moment estimation from a gradient computation perspective. The gradient of the weight matrix is the multiplication of the input and the gradient of the output. Instead of finding low-rank approximations of the first and second moments, as in previous work, we propose tracking the input and output gradients to efficiently estimate moments. We provide analyses of the similarities and differences between our proposed method, the widely used Adam optimizer, and previous memory-efficient optimizers designed to reduce memory usage. We conduct experiments to verify the effectiveness of our method, which reduces memory usage by up to $30$% while preserving similar performance or even improving the performance of Adam.
Successful Page Load