ICML How Does Adaptive Optimization Impact Local Neural Network Geometry?

Poster
in
Workshop: HiLD: High-dimensional Learning Dynamics Workshop

How Does Adaptive Optimization Impact Local Neural Network Geometry?

Kaiqi Jiang · Dhruv Malik · Yuanzhi Li

[ Abstract ]

[ Poster]

Abstract: Adaptive optimization methods are well known to achieve superior convergence relative to vanilla gradient methods. The traditional viewpoint explains this improved performance by arguing that adaptive algorithms mimic the behavior of a second-order method by adapting to the global geometry of the loss function. We argue that in the context of neural network optimization, this traditional viewpoint is insufficient. Instead, we advocate for a local trajectory analysis. For iterate trajectories produced by running a generic optimization algorithm OPT, we introduce

R_{med}^{OPT}

$R^{\text{OPT}}_{\text{med}}$ , a statistic that is analogous to the condition number of the loss Hessian evaluated at the iterates. Through extensive experiments on language models, we show that adaptive methods such as Adam bias the trajectories towards regions where

R_{med}^{Adam}

$R^{\text{Adam}}_{\text{med}}$ is small, where one might expect faster optimization. By contrast, SGD (with momentum) biases the trajectories towards regions where

R_{med}^{SGD}

$R^{\text{SGD}}_{\text{med}}$ is comparatively large. We complement these empirical observations with a theoretical result that provably demonstrates this phenomenon in the simplified setting of a two-layer linear network.

Chat is not available.

Poster in Workshop: HiLD: High-dimensional Learning Dynamics Workshop

How Does Adaptive Optimization Impact Local Neural Network Geometry?

Kaiqi Jiang · Dhruv Malik · Yuanzhi Li

Poster
in
Workshop: HiLD: High-dimensional Learning Dynamics Workshop