Spotlight
in
Workshop: Over-parameterization: Pitfalls and Opportunities

Beyond Implicit Regularization: Avoiding Overfitting via Regularizer Mirror Descent

Navid Azizan · Sahin Lale · Babak Hassibi

Abstract: It is widely recognized that, despite perfectly interpolating the training data, deep neural networks (DNNs) can still generalize well due in part to the ``implicit regularization'' induced by the learning algorithm. Nonetheless, ``explicit regularization'' (or weight decay) is often used to avoid overfitting, especially when the data is known to be corrupted. There are several challenges with using explicit regularization, most notably unclear convergence properties. In this paper, we propose a novel variant of the stochastic mirror descent (SMD) algorithm, called \emph{regularizer mirror descent (RMD)}, for training DNNs. The starting point for RMD is a cost which is the sum of the training loss and any convex regularizer of the network weights. For highly-overparameterized models, RMD provably converges to a point ``close'' to the optimal solution of this cost. The algorithm imposes virtually no additional computational burden compared to stochastic gradient descent (SGD) or weight decay, and is parallelizable in the same manner that they are. Our experimental results on training sets that contain some errors suggest that, in terms of generalization performance, RMD outperforms both SGD, which implicitly regularizes for the $\ell_2$ norm of the weights, and weight decay, which explicitly does so. This makes RMD a viable option for training with regularization in DNNs. In addition, RMD can be used for regularizing the weights to be close to a desired weight vector, which is particularly important for continual learning.