Geometric Convergence of Gauss–Newton for Neural Networks: Riemannian Geometry and Adaptive Damping
Abstract
Ill-conditioned kernel matrices and loss landscapes can make first-order methods for training neural networks converge slowly. We establish non-asymptotic convergence bounds for the Gauss–Newton method in both under- and overparameterized regimes, showing it avoids these conditioning bottlenecks. In the underparameterized setting, Gauss–Newton gradient flow in parameter space induces a Riemannian gradient flow on a low-dimensional submanifold of function space. Using tools from Riemannian optimization, we show that, under an appropriate output scaling, the loss satisfies geodesic Polyak–Lojasiewicz and Lipschitz-smoothness conditions, implying geometric convergence to the optimal in-class predictor at an explicit rate independent of Gram-matrix conditioning. In the overparameterized setting, we identify adaptive, curvature-aware regularization schedules and prove fast geometric convergence to a global optimum for both Gauss–Newton gradient flow and discrete-time Gauss–Newton iterates, with rates independent of the minimum eigenvalue of the neural tangent kernel and, locally, independent of the strong-convexity modulus. Overall, Gauss–Newton can be provably faster in ill-conditioned regimes where first-order methods slow down.