Width Independent Bounds for the Local Lipschitz Constant of Deep Neural Networks at Random Initialization and after Lazy Training
Abstract
A plethora of recent works has shown that for wide, overparameterized neural networks, training with Stochastic Gradient Descent (SGD) often leads to interpolation of the training data without sacrificing generalization performance. A key parameter that is not only closely connected to generalization properties, but is also closely tied to other desiderata such as robustness and resistance to adversarial perturbations is the Lipschitz constant of the neural network. While empirically, the Lipschitz constant has been shown not to increase with network width, theoretical findings only provide bounds with logarithmic growth in the width and only for the random initialization of ReLU-networks. In this work, we close this gap for neural networks with smooth activations by showing that, both at random initialization and throughout lazy training, the local Lipschitz constant of deep neural networks does not increase with network width. More precisely, we establish novel non-asymptotic (finite width) upper bounds and corroborate them by numerical experiments.