ICML Poster On The Concurrence of Layer-wise Preconditioning Methods and Provable Feature Learning

Poster

On The Concurrence of Layer-wise Preconditioning Methods and Provable Feature Learning

Thomas T. Zhang · Behrad Moniri · Ansh Nagwekar · Faraz Rahman · Anton Xue · Hamed Hassani · Nikolai Matni

West Exhibition Hall B2-B3 #W-1018

[ Abstract ] [ Lay Summary ]

[ Poster] [ OpenReview]

Tue 15 Jul 11 a.m. PDT — 1:30 p.m. PDT

Abstract: Layer-wise preconditioning methods are a family of memory-efficient optimization algorithms that introduce preconditioners per axis of each layer's weight tensors. These methods have seen a recent resurgence, demonstrating impressive performance relative to entry-wise ("diagonal") preconditioning methods such as Adam(W) on a wide range of neural network optimization tasks. Complementary to their practical performance, we demonstrate that layer-wise preconditioning methods are provably necessary from a statistical perspective. To showcase this, we consider two prototypical models, *linear representation learning* and *single-index learning*, which are widely used to study how typical algorithms efficiently learn useful *features* to enable generalization.In these problems, we show SGD is a suboptimal feature learner when extending beyond ideal isotropic inputs $\mathbf{x} \sim \mathsf{N}(\mathbf{0}, \mathbf{I})$ and well-conditioned settings typically assumed in prior work.We demonstrate theoretically and numerically that this suboptimality is fundamental, and that layer-wise preconditioning emerges naturally as the solution. We further show that standard tools like Adam preconditioning and batch-norm only mildly mitigate these issues, supporting the unique benefits of layer-wise preconditioning.

Lay Summary:

Recently, a new family of optimization algorithms have shown great promise in making neural network training faster and more efficient in practice. These algorithms introduce new versions "preconditioning", which is the practice of "re-sizing" a problem to hopefully make it easier to find good solutions. The current standard optimizer, Adam, performs "preconditioning" independently on each parameter of a neural network, while these new algorithms do so in a new way that can also take into account dependencies between parameters within each layer of the network, hence "layer-wise preconditioning".On the other hand, theory researchers have proposed various problems to understand very clearly how neural networks can find good solutions. These works typically study the most basic optimization algorithm in stochastic gradient descent (SGD). However, we found that SGD is somehow fundamentally limited: when the data is not perfectly "well-conditioned" (imagine some coordinates of the data are larger than others), then these positive results about neural network training no longer hold, in theory or in practice.In finding ways to adjust SGD to work for general types of data, we found that the resulting algorithm aligns with these (practical) "layer-wise preconditioning" algorithms. This has implications both for theorists, where these results provide a concrete path to analyzing larger families of neural network optimization algorithms, and practitioners, where these results provide a strong mathematical motivation for why these new algorithms work.

Chat is not available.