Implicit Bias of the Step Size in Linear Diagonal Neural Networks

Mor Shpigel Nacson · Kavya Ravichandran · Nati Srebro · Daniel Soudry

Hall E #211

Keywords: [ Optimization ] [ DL: Theory ]

[ Abstract ]
[ Poster [ Paper PDF
Wed 20 Jul 3:30 p.m. PDT — 5:30 p.m. PDT
Spotlight presentation: Deep Learning
Wed 20 Jul 7:30 a.m. PDT — 9 a.m. PDT

Abstract: Focusing on diagonal linear networks as a model for understanding the implicit bias in underdetermined models, we show how the gradient descent step size can have a large qualitative effect on the implicit bias, and thus on generalization ability. In particular, we show how using large step size for non-centered data can change the implicit bias from a "kernel" type behavior to a "rich" (sparsity-inducing) regime --- even when gradient flow, studied in previous works, would not escape the "kernel" regime. We do so by using dynamic stability, proving that convergence to dynamically stable global minima entails a bound on some weighted $\ell_1$-norm of the linear predictor, i.e. a "rich" regime. We prove this leads to good generalization in a sparse regression setting.

Chat is not available.