Poster
On The Concurrence of Layer-wise Preconditioning Methods and Provable Feature Learning
Thomas T. Zhang · Behrad Moniri · Ansh Nagwekar · Faraz Rahman · Anton Xue · Hamed Hassani · Nikolai Matni
West Exhibition Hall B2-B3 #W-1018
Recently, a new family of optimization algorithms have shown great promise in making neural network training faster and more efficient in practice. These algorithms introduce new versions "preconditioning", which is the practice of "re-sizing" a problem to hopefully make it easier to find good solutions. The current standard optimizer, Adam, performs "preconditioning" independently on each parameter of a neural network, while these new algorithms do so in a new way that can also take into account dependencies between parameters within each layer of the network, hence "layer-wise preconditioning".On the other hand, theory researchers have proposed various problems to understand very clearly how neural networks can find good solutions. These works typically study the most basic optimization algorithm in stochastic gradient descent (SGD). However, we found that SGD is somehow fundamentally limited: when the data is not perfectly "well-conditioned" (imagine some coordinates of the data are larger than others), then these positive results about neural network training no longer hold, in theory or in practice.In finding ways to adjust SGD to work for general types of data, we found that the resulting algorithm aligns with these (practical) "layer-wise preconditioning" algorithms. This has implications both for theorists, where these results provide a concrete path to analyzing larger families of neural network optimization algorithms, and practitioners, where these results provide a strong mathematical motivation for why these new algorithms work.
Live content is unavailable. Log in and register to view live content