Timezone: »

Sharpness Minimization Algorithms Do Not Only Minimize Sharpness To Achieve Better Generalization
Kaiyue Wen · Tengyu Ma · Zhiyuan Li

The reason why overparameterized neural networks can generalize remains mysterious. Existing proofs show that common stochastic optimizers tend to converge to flat minimizers of training loss, and thus a natural and popular explanation is that flatness implies generalization. This work critically examines this explanation. Through theoretical and empirical investigation, we identify the following three scenarios for two-layer ReLU networks: (1) flatness provably implies generalization; (2) there exist non-generalizing flattest models and sharpness minimization algorithms fail to generalize poorly, and (3) perhaps most strikingly, there exist non-generalizing flattest models, but sharpness minimization algorithms still generalize. Our results suggest that the relationship between sharpness and generalization subtly depends on the data distributions and the model architectures and sharpness minimization algorithms do not only minimize sharpness to achieve better generalization. This calls for the search for other explanations for the generalization of over-parameterized neural networks.

Author Information

Kaiyue Wen (Tsinghua University)
Tengyu Ma (Stanford)
Zhiyuan Li (Stanford University)

More from the Same Authors