Skip to yearly menu bar Skip to main content


Poster

Benign Overfitting in Adversarially Trained Neural Networks

Yunjuan Wang · Kaibo Zhang · Raman Arora


Abstract: Benign overfitting is the phenomenon wherein none of the predictors in the hypothesis class can achieve perfect accuracy (i.e., the non-realizable or noisy setting), but a model that interpolates the training data sill achieves good generalization. A series of recent works aim to understand this phenomenon, for regression and classification tasks using linear predictors as well as two-layer neural networks. In this paper, we study such a benign overfitting phenomenon in an adversarial setting. We show that under a distributional assumption, interpolating neural networks found using adversarial training generalize well despite additive inference-time attacks. Specifically, we provide convergence and generalization guarantees for adversarial training of two-layer networks (both, with smooth and non-smooth activation functions), showing that under moderate $\ell_2$ norm perturbation budget, the trained model has near-zero robust training loss and near-optimal robust generalization error. We support our theoretical findings with an empirical study on synthetic and real-world data.

Live content is unavailable. Log in and register to view live content