Abstract: We reveal a strong implicit bias of stochastic gradient descent (SGD) that drives initially expressive networks to much simpler subnetworks, thereby dramatically reducing the number of independent parameters, and improving generalization. To reveal this bias, we identify invariant sets, or subsets of parameter space that trap SGD dynamics once entered. We further establish sufficient conditions for stochastic attractivity to these simpler invariant sets based on a competition between the loss landscape's curvature around the invariant set and the noise introduced by stochastic gradients. Remarkably, we find that an increased level of SGD noise strengthens attractivity, leading to the emergence of attractive invariant sets associated with saddle-points or local maxima of the train loss. We further demonstrate how this simplifying process of stochastic collapse benefits generalization in a linear teacher-student framework. Our analysis also mechanistically explains why early training with large learning rates for extended periods benefits subsequent generalization by promoting stochastic collapse. Finally we empirically demonstrate the strong effect of stochastic collapse in benchmark architectures and datasets, revealing surprisingly large groups of redundant neurons with identical incoming and outgoing weights after training, due to attractive invariant sets associated with permutation symmetry.
Joint with with Feng Chen, Daniel Kunin and Atushi Yamamura* (* equal authors)