Poster
in
Workshop: High-dimensional Learning Dynamics Workshop: The Emergence of Structure and Reasoning
Adam Exploits $\ell_\infty$-geometry of Loss Landscape via Coordinate-wise Adaptivity
Shuo Xie · Mohamad Amin Mohamadi · Zhiyuan Li
Abstract:
Adam outperforms SGD in transformer optimization for language modeling tasks. Yet such benefits are not well-understood theoretically -- previous theoretical convergence analysis for Adam and SGD mainly focus on the number of steps $T$ and are already minimax-optimal in non-convex cases, which are both $O(T^{-1/4})$. In this work, we argue that the better dependency on the loss smoothness and model dimension is the key that Adam optimizes faster than SGD, which is typically much larger than total steps for modern language modeling tasks. More specifically, we give a new convergence analysis for Adam under novel assumptions that loss is smooth under $\ell_\infty$ geometry rather than the more common $\ell_2$ geometry, which yields a much better empirical smoothness constant for GPT-2 models. Moreover, we show that if we rotate the pretraining loss randomly, Adam can be outperformed by some variants of SGD which is invariant to rotations. This implies that any practically relevant explanation of Adam's optimization benefit must involve non-rotational invariant properties of loss, such as $\ell_\infty$ smoothness as used in our analysis.
Chat is not available.