Workshop

Over-parameterization: Pitfalls and Opportunities

Yasaman Bahri, Quanquan Gu, Amin Karbasi, Hanie Sedghi

Abstract:

Modern machine learning models are often highly over-parameterized. The prime examples are neural network architectures achieving state-of-the-art performance, which have many more parameters than training examples. While these models can empirically perform very well, they are not well understood. Worst-case theories of learnability do not explain their behavior. Indeed, over-parameterized models sometimes exhibit "benign overfitting", i.e., they have the power to perfectly fit training data (even data modified to have random labels), yet they achieve good performance on the test data. There is evidence that over-parameterization may be helpful both computational and statistically, although attempts to use phenomena like double/multiple descent to explain that over-parameterization helps to achieve small test error remain controversial. Besides benign overfitting and double/multiple descent, many other interesting phenomena arise due to over-parameterization, and many more may have yet to be discovered. Many of these effects depend on the properties of data, but we have only simplistic tools to measure, quantify, and understand data. In light of rapid progress and rapidly shifting understanding, we believe that the time is ripe for a workshop focusing on understanding over-parameterization from multiple angles.

Gathertown room1 link: https://eventhosts.gather.town/DbHJbA5ArXpTIoap/icml-oppo-2021
Gathertown room2 link: https://eventhosts.gather.town/UtqQ3jSJ7wnN0anj/icml-oppo-2021-room-2

Chat is not available.

Timezone: »

Schedule

Sat 9:00 a.m. - 9:05 a.m.
Opening Remarks (opening)   
Sat 9:05 a.m. - 9:50 a.m.
  

Because the phenomenon of adversarial examples in deep networks poses a serious barrier to the reliable and robust application of this methodology, there has been considerable interest in why it arises. We consider ReLU networks of constant depth with independent Gaussian parameters, and show that small perturbations of input vectors lead to large changes of outputs. Building on insights of Daniely and Schacham (2020) and of Bubeck et al (2021), we show that adversarial examples arise in these networks because the functions that they compute are very close to linear. The main result is for networks with a constant depth, but we also show that some constraint on depth is necessary for a result of this kind: there are suitably deep networks that, with constant probability, compute a function that is close to constant. This, combined with results characterizing benign overfitting in linear regression, suggests two potential mechanisms behind adversarial examples in overparameterized settings, one arising from label noise and the other from random initialization. Joint work with Sébastien Bubeck and Yeshwanth Cherapanamjeri

Peter Bartlett
Sat 9:50 a.m. - 10:00 a.m.
Live Q&A with Peter Bartlett (Live Q&A)
Sat 10:00 a.m. - 10:55 a.m.
  

The success of deep learning is due, to a large extent, to the remarkable effectiveness of gradient-based optimization methods applied to large neural networks. In this talk, I will discuss some general mathematical principles allowing for efficient optimization in over-parameterized non-linear systems, a setting that includes deep neural networks. I will discuss that optimization problems corresponding to these systems are not convex, even locally, but instead satisfy the Polyak-Lojasiewicz (PL) condition on most of the parameter space, allowing for efficient optimization by gradient descent or SGD. I will connect the PL condition of these systems to the condition number associated with the tangent kernel and show how a non-linear theory for those systems parallels classical analyses of over-parameterized linear equations. As a separate related development, I will discuss a perspective on the remarkable recently discovered phenomenon of transition to linearity (constancy of NTK) in certain classes of large neural networks. I will show how this transition to linearity results from the scaling of the Hessian with the size of the network controlled by certain functional norms. Combining these ideas, I will show how the transition to linearity can be used to demonstrate the PL condition and convergence for a general class of wide neural networks. Finally, I will comment on systems that are ''almost'' over-parameterized, which appears to be common in practice.

Based on joint work with Chaoyue Liu and Libin Zhu

Mikhail Belkin
Sat 10:55 a.m. - 11:10 a.m.
  

We discover a new set of empirical properties of interpolating classifiers, including neural networks, kernel machines and decision trees. Informally, the output distribution of an interpolating classifier matches the distribution of true labels, when conditioned on certain subgroups of the input space. For example, if we mislabel 30% of images of dogs as cats in the train set of CIFAR-10, then a ResNet trained to interpolation will in fact mislabel roughly 30% of dogs as cats on the test set as well, while leaving other classes unaffected. These behaviors are not captured by classical generalization, which would only consider the average error over the inputs, and not where these errors occur. We introduce and experimentally validate a formal conjecture that specifies the subgroups for which we expect this distributional closeness. Further, we show that these properties can be seen as a new form of generalization, which advances our understanding of the implicit bias of interpolating methods.

Preetum Nakkiran, Yamini Bansal
Sat 11:10 a.m. - 11:25 a.m.
  

This paper examines the impact of static sparsity on the robustness of a trained network to weight perturbations, data corruption, and adversarial examples. We show that, up to a certain sparsity achieved by increasing network width and depth while keeping the network capacity fixed, sparsified networks consistently match and often outperform their initially dense versions. Robustness and accuracy decline simultaneously for very high sparsity due to loose connectivity between network layers. Our findings show that a rapid robustness drop caused by network compression observed in the literature is due to a reduced network capacity rather than sparsity.

Lukas Timpl, Rahim Entezari, Hanie Sedghi, Behnam Neyshabur, Olga Saukh
Sat 11:25 a.m. - 11:40 a.m.
  

Even though the goal of pruning is often to reduce the computational resources consumed during training or inference, it comes as no surprise to theoreticians or practitioners that pruning also improves generalization. In this work, we empirically study pruning's effect on generalization, focusing on two state-of-the-art pruning algorithms: weight rewinding and learning-rate rewinding. However, each pruning algorithm is actually an aggregation of many design choices: a weight scoring heuristic, a pruning schedule, a learning rate schedule, among other factors, each of which might contribute to generalization improvement in different ways. We thus ablate each design choice to determine whether it is responsible for pruning's effect on generalization. We find that each individual contribution is limited compared to the generalization improvement achieved with the pruning algorithm in its entirety. Our results also highlight similarities and differences between the effects on generalization caused by pruning and model-width scaling.

Tian Jin, Gintare Karolina Dziugaite, Michael Carbin
Sat 12:30 p.m. - 1:25 p.m.
  

This talk gives an overview of recent results in a line of theoretical work that started 3 decades ago in statistical physics. We will first discuss teacher-student setting of the generalized linear regression. We illustrate the presence of the interpolation peak for classification with ridge loss and its vanishing with regularization. We show that, in the spherical perceptron, the optimally regularized logistic regression approaches very closely the Bayes optimal accuracy. We contrast this with the non-convex case of phase retrieval where the canonical empirical risk minimization performs poorly compared to the Bayes-optimal error. We then move towards learning with hidden units and analyze double descent in learning with generic fixed features and any convex loss. The formulas we obtain a generic enough to describe the learning of the last layer of neural networks for realistic data and networks. Finally, for the phase retrieval, we are able to analyze gradient descent in the feature-learning regime of a two-layer neural network where we show that overparametrization allows a considerable reduction of the sample complexity. Concretely, an overparametrized neural network only needs twice the input dimension of samples, while non-overparametrized network needs constant times more, and kernel regression quadratically many samples in the input dimension.

Lenka Zdeborova
Sat 1:25 p.m. - 2:20 p.m.
  

I consider two layer neural networks trained with square loss in the linear (lazy) regime. Under overparametrization, gradient descent converges to the minimum norm interpolant, and I consider this as well as the hole ridge regularization path. From a statistical viewpoint, these approaches are random features models, albeit of a special type. They are also equivalent to kernel ridge regression, with a random kernel of rank N*d (where N is the number of hidden neurons, and d the input dimension). I will describe a precise characterization of the generalization error when N, d and the sample size are polynomially related (and for covariates that are uniform on the d-dimensional sphere). I will then discuss the limitation of these approaches. I will explain how sparse random feature models can be learnt efficiently to try to address these limitations. [Based on joint work with Michael Celentano, Song Mei, Theodor Misiakiewicz, Yiqiao Zhong]

Andrea Montanari
Sat 2:20 p.m. - 2:35 p.m.
  

Stochastic gradient descent (SGD) with momentum is widely used for training modern deep learning architectures. While it is well understood that using momentum can lead to faster convergence rate in various settings, it has also been empirically observed that adding momentum yields higher generalization. This paper formally studies how momentum help generalization in deep learning:

Samy Jelassi, Yuanzhi Li
Sat 2:35 p.m. - 2:50 p.m.
  

As its width tends to infinity, a deep neural network's behavior under gradient descent can become simplified and predictable (e.g.\ given by the Neural Tangent Kernel (NTK)), if it is parametrized appropriately (e.g.\ the NTK parametrization). However, we show that the standard and NTK parametrizations of a neural network do not admit infinite-width limits that can \emph{learn} features, which is crucial for pretraining and transfer learning such as with BERT. We propose simple modifications to the standard parametrization to allow for feature learning in the limit. Using the \emph{Tensor Programs} technique, we derive explicit formulas for such limits. On Word2Vec and few-shot learning on Omniglot via MAML, two canonical tasks that rely crucially on feature learning, we compute these limits exactly. We find that they outperform both NTK baselines and finite-width networks, with the latter approaching the infinite-width feature learning performance as width increases.

Greg Yang, Edward Hu
Sat 2:50 p.m. - 3:05 p.m.
  
Classically, data interpolation with a parametrized model class is possible as long as the number of parameters is larger than the number of equations to be satisfied. A puzzling phenomenon in deep learning is that models are trained with many more parameters than what this classical theory would suggest. We propose a theoretical explanation for this phenomenon. We prove that for a broad class of data distributions and model classes, overparametrization is necessary if one wants to interpolate the data smoothly. Namely we show that smooth interpolation requires $d$ times more parameters than mere interpolation, where $d$ is the ambient data dimension. We prove this universal law of robustness for any smoothly parametrized function class with polynomial size weights, and any covariate distribution verifying isoperimetry (or a mixture thereof). In the case of two-layer neural networks and Gaussian covariates, this law was conjectured in prior work by Bubeck, Li and Nagaraj.
Sebastien Bubeck, Mark Sellke
Sat 3:55 p.m. - 4:50 p.m.
  

A frequent criticism from the statistics community to modern machine learning is the lack of rigorous uncertainty quantification. Instead, the machine learning community would argue that conventional uncertainty quantification based on idealized distributional assumptions may be too restrictive for real data. This paper will make progress in resolving the above inference dilemma. We propose a computationally efficient method to construct nonparametric, heteroskedastic prediction bands for uncertainty quantification, with or without any user-specified predictive model. The data-adaptive prediction band is universally applicable with minimal distributional assumptions, with strong non-asymptotic coverage properties, and easy to implement using standard convex programs. Our approach can be viewed as a novel variance interpolation with confidence and further leverages techniques from semi-definite programming and sum-of-squares optimization. Theoretical and numerical performances for the proposed approach for uncertainty quantification are analyzed.

Tengyuan Liang
Sat 4:50 p.m. - 5:45 p.m.
  

The magnitude of the weights of a neural network is a fundamental measure of complexity that plays a crucial role in the study of implicit and explicit regularization. For example, in recent work, gradient descent updates in overparameterized models asymptotically lead to solutions that implicitly minimize the ell2 norm of the parameters of the model, resulting in an inductive bias that is highly architecture-dependent. To investigate the properties of learned functions, it is natural to consider a function space view given by the minimum ell2 norm of weights required to realize a given function with a given network. We call this the “induced regularizer” of the network. Building on a line of recent work, we study the induced regularizer of linear convolutional neural networks with a focus on the role of kernel size and the number of channels. We introduce an SDP relaxation of the induced regularizer, that we show is tight for networks with a single input channel. Using this SDP formulation, we show that the induced regularizer is independent of the number of the output channels for single-input channel networks, and for multi-input channel networks, we show independence given sufficiently many output channels. Moreover, we show that as the kernel size increases, the induced regularizer interpolates between a basis-invariant norm and a basis-dependent norm that promotes sparse structures in Fourier space. Based on joint work with Meena Jagadeesan and Ilya Razenshteyn.

Suriya Gunasekar
Sat 5:45 p.m. - 6:00 p.m.
  

While deep reinforcement learning (RL) methods present an appealing approach to sequential decision-making, such methods are often unstable in practice. What accounts for this instability? Recent theoretical analysis of overparameterized supervised learning with stochastic gradient descent shows that learning is driven by an implicit regularizer once it reaches the zero training error regime, which results in “simpler” functions that generalize. However, in this paper, we show that in the case of deep RL, implicit regularization can instead lead to degenerate features by characterizing the induced implicit regularizer in the semi-gradient TD learning setting in RL with a fixed-dataset (i.e., offline RL). The regularize recapitulates recent empirical findings regarding the rank collapse of learned features and provides an understanding for its cause. To address the adverse impacts of this implicit regularization, we propose a simple and effective explicit regularizer for TD learning, DR3. DR3 minimizes the similarity of learned features of the Q-network at consecutive state-action tuples in the TD update. Empirically, when combined with existing offline RL methods, DR3 substantially improves both performance and stability on Atari 2600 games, D4RL domains, and robotic manipulation from images.

Aviral Kumar, Rishabh Agarwal, Aaron Courville, Tengyu Ma, George Tucker, Sergey Levine
Sat 6:00 p.m. - 6:15 p.m.
  
It is widely recognized that, despite perfectly interpolating the training data, deep neural networks (DNNs) can still generalize well due in part to the ``implicit regularization'' induced by the learning algorithm. Nonetheless, ``explicit regularization'' (or weight decay) is often used to avoid overfitting, especially when the data is known to be corrupted. There are several challenges with using explicit regularization, most notably unclear convergence properties. In this paper, we propose a novel variant of the stochastic mirror descent (SMD) algorithm, called \emph{regularizer mirror descent (RMD)}, for training DNNs. The starting point for RMD is a cost which is the sum of the training loss and any convex regularizer of the network weights. For highly-overparameterized models, RMD provably converges to a point ``close'' to the optimal solution of this cost. The algorithm imposes virtually no additional computational burden compared to stochastic gradient descent (SGD) or weight decay, and is parallelizable in the same manner that they are. Our experimental results on training sets that contain some errors suggest that, in terms of generalization performance, RMD outperforms both SGD, which implicitly regularizes for the $\ell_2$ norm of the weights, and weight decay, which explicitly does so. This makes RMD a viable option for training with regularization in DNNs. In addition, RMD can be used for regularizing the weights to be close to a desired weight vector, which is particularly important for continual learning.
Navid Azizan, Sahin Lale, Babak Hassibi
Sat 6:15 p.m. - 6:20 p.m.
Closing Remarks (closing)   
-
[ Visit Poster at Spot A2 in Virtual World ]

We investigate the performance of distributed learning for large-scale linear regression where the model unknowns are distributed over the network. We provide high-probability bounds on the generalization error for this distributed learning setting for isotropic as well as correlated Gaussian regressors. Our investigations show that the generalization error of the distributed solution can grow unboundedly even though the training error is low. We highlight the effect of partitioning of the training data over the network of learners on the generalization error. Our results are particularly interesting for the overparametrized scenario, illustrating fast convergence but also possibly unbounded generalization error.

Martin Hellkvist, Ayca Ozcelikkale
-
[ Visit Poster at Spot A0 in Virtual World ]

In this work, we present an analytical solution for a simple model of classification problems on structured data. Using methods from statistical physics, we obtain a precise asymptotic expression for the test errors of random feature models trained on a strong and weak features - a model of data with input data covariance built from independent blocks allowing us to tune the saliency of low-dimensional structures and their alignment with respect to the target function. Leveraging our analytical result, we explore how properties of data distributions impact generalization in the over-parametrized regime and compare results for the logistic and square loss. Our results show in particular that the logistic loss benefits more robustly from structured data than the squared loss. Numerical experiments on MNIST and CIFAR10 confirm this insight.

Stéphane d'Ascoli, Marylou Gabrié, Levent Sagun, Giulio Biroli
-
[ Visit Poster at Spot A3 in Virtual World ]

We prove bounds on the population risk of the maximum margin algorithm for two-class linear classification. For linearly separable training data, the maximum margin algorithm has been shown in previous work to be equivalent to a limit of training with logistic loss using gradient descent, as the training error is driven to zero. We analyze this algorithm applied to random data including misclassification noise. Our assumptions on the clean data include the case in which the class-conditional distributions are standard normal distributions. The misclassification noise may be chosen by an adversary, subject to a limit on the fraction of corrupted labels. Our bounds show that, with sufficient over-parameterization, the maximum margin algorithm trained on noisy data can achieve nearly optimal population risk.

Niladri Chatterji, Phil Long
-
[ Visit Poster at Spot A4 in Virtual World ]

Neural networks can overfit the noisy training data and still generalize well on the unseen data. Current methods focus on training dynamics or varying the noise in the dataset to understand this phenomenon. We propose a method that allows us to compare the similarity between two sets of samples for a given network. Our approach relies on the weights learned by the network and is independent of the training dynamics & properties of the training dataset. Using the proposed method, we investigate three hypotheses empirically: Are real and noisy samples learned at different parts of the network? Do real and noisy samples contribute equally to the generalization? Are real samples more similar to each other than the noisy samples?

Sudhanshu Ranjan
-
[ Visit Poster at Spot A5 in Virtual World ]

We establish conditions under which gradient descent applied to fixed-width deep networks drives the logistic loss to zero, and prove bounds on the rate of convergence. Our analysis applies for smoothed approximations to the ReLU, such as Swish and the Huberized ReLU, proposed in previous applied work. We provide two sufficient conditions for convergence. The first is simply a bound on the loss at initialization. The second is a data separation condition used in prior analyses.

Niladri Chatterji, Phil Long, Peter Bartlett
-
[ Visit Poster at Spot A6 in Virtual World ]

We study the properties of alignment, a form of implicit regularization, in linear neural networks under gradient descent. We define alignment for fully connected networks with multidimensional outputs and show that it is a natural extension of alignment in networks with 1d outputs as defined in (Ji & Telgarsky, 2018). While in fully connected networks there always exists a global minimum corresponding to an aligned solution, we analyze alignment as it relates to the training process. Namely, we characterize when alignment is an invariant of training under gradient descent by providing necessary and sufficient conditions for this invariant to hold. In such settings, the dynamics of gradient descent simplify, thereby allowing us to provide an explicit learning rate under which the network converges linearly to a global minimum. We conclude by analyzing networks with layer constraints, such as convolutional networks, and prove that alignment is impossible with sufficiently large datasets.

Adit Radhakrishnan, Eshaan Nichani, Daniel Bernstein, Caroline Uhler
-
[ Visit Poster at Spot B0 in Virtual World ]

Recent works have demonstrated that increasing model capacity through width in over-parameterized neural networks leads to a decrease in test risk. Model capacity, however, can also be increased through depth, yet understanding the impact of increasing depth on test risk remains an open question. In this work, we demonstrate that the test risk of over-parameterized convolutional networks is a U-shaped curve (i.e. monotonically decreasing, then increasing) with increasing depth. We first provide empirical evidence for this phenomenon via image classification experiments using both ResNets and the convolutional neural tangent kernel (CNTK). We then present a novel linear regression framework for characterizing the impact of depth on test risk, and show that increasing depth leads to a U-shaped test risk for the linear CNTK. In particular, we prove that the linear CNTK corresponds to a depth-dependent linear transformation on the original space and characterize properties of this transformation. We then analyze over-parameterized linear regression under arbitrary linear transformations and, in simplified settings, identify the depths which minimize the bias and variance terms of the test risk.

Eshaan Nichani, Adit Radhakrishnan, Caroline Uhler
-
[ Visit Poster at Spot B1 in Virtual World ]

Over-parametrization has become a popular technique in deep learning. It is observed that by over-parametrization, a larger neural network needs a fewer training iterations than a smaller one to achieve a certain level of performance --- namely, over-parametrization leads to acceleration in optimization. However, despite that over-parametrization is widely used nowadays, little theory is available to explain the acceleration due to over-parametrization. In this paper, we propose understanding it by studying a simple problem first. Specifically, we consider the setting that there is a single teacher neuron with quadratic activation, where over-parametrization is realized by having multiple student neurons learn the data generated from the teacher neuron. We provably show that over-parametrization helps the iterate generated by gradient descent to enter the neighborhood of a global optimal solution that achieves zero testing error faster.

Jun-Kun Wang, Jake Abernethy
-
[ Visit Poster at Spot B2 in Virtual World ]

A recent study on overparameterized neural networks by Huh et al. (2021) finds that gradient-based training of overparameterized neural networks finds low-rank parameters, implying that certain implicit low-rank constraints are in action. Inspired by this, we empirically study the effective VC dimension of neural networks with low-rank parameters. Our finding is that their effective VC dimension is proportional to a specific weighted sum of per-layer parameter counts, which we call the effective number of parameters. As the effective VC dimension lower bounds the VC dimension, our result suggests a possibility that the analytic VC dimension upper bound proposed by Bartlett et al. (2019) is indeed tight for neural networks with low-rank parameters.

Daewon Seo, Hongyi Wang, Dimitris Papailiopoulos, Kangwook Lee
-
[ Visit Poster at Spot B3 in Virtual World ]
``Benign overfitting'', where classifiers memorize noisy training data yet still achieve a good generalization performance, has drawn great attention in the machine learning community. To explain this surprising phenomenon, a series of works have provided theoretical justification in over-parameterized linear regression and classification, and kernel methods. However, it is not clear if benign overfitting still occurs in the presence of adversarial examples, which are examples with tiny and intentional perturbations to fool the classifiers. In this paper, we show that benign overfitting indeed occurs in adversarial training, a principled approach to defending against adversarial examples. In detail, we prove the risk bounds of the adversarially trained linear classifier on the mixture of sub-Gaussian data under $\ell_p$ adversarial perturbations. Our result suggests that under moderate perturbations, adversarially trained linear classifiers can achieve the near-optimal standard and adversarial risks, despite overfitting the noisy training data. Numerical experiments validate our theoretical findings.
Jinghui Chen, Yuan Cao, Yuan Cao, Quanquan Gu
-
[ Visit Poster at Spot B4 in Virtual World ]

The double descent curve is one of the most intriguing properties of deep neural networks. It contrasts the classical bias-variance curve with the behavior of modern neural networks, occurring where the number of samples nears the number of parameters. In this work, we explore the connection between the double descent phenomena and the number of samples in the deep neural network setting. In particular, we propose a construction which augments the existing dataset by artificially increasing the number of samples. This construction empirically mitigates the double descent curve in this setting. We reproduce existing work on deep double descent, and observe a smooth descent into the overparameterized region for our construction. This occurs both with respect to the model size, and with respect to the number epochs.

John Chen, Qihan Wang, Tasos Kyrillidis
-
[ Visit Poster at Spot B5 in Virtual World ]

A key challenge facing deep learning is that neural networks are often not robust to shifts in the underlying data distribution. We study this problem from the perspective of the statistical concept of parameter identification. Generalization bounds from learning theory often assume that the test distribution is close to the training distribution. In contrast, if we can identify the ``true'' parameters, then the model generalizes to arbitrary distribution shifts. However, neural networks are typically overparameterized, making parameter identification impossible. We show that for quadratic neural networks, we can identify the function represented by the model even though we cannot identify its parameters. Thus, we can obtain robust generalization bounds even in the overparameterized setting. We leverage this result to obtain new bounds for contextual bandits and transfer learning with quadratic neural networks. Overall, our results suggest that we can improve robustness of neural networks by designing models that can represent the true data generating process. In practice, the true data generating process is often very complex; thus, we study how our framework might connect to neural module networks, which are designed to break down complex tasks into compositions of simpler ones. We prove robust generalization bounds when individual neural modules are identifiable.

Kan Xu, Hamsa Bastani, Osbert Bastani
-
[ Visit Poster at Spot B6 in Virtual World ]
In overparametrized models, the noise in stochastic gradient descent (SGD) implicitly regularizes the optimization trajectory and determines which local minimum SGD converges to. Motivated by empirical studies that demonstrate that training with noisy labels improves generalization, we study the implicit regularization effect of SGD with label noise. We show that SGD with label noise converges to a stationary point of a regularized loss $L(\theta) +\lambda R(\theta)$, where $L(\theta)$ is the training loss, $\lambda$ is an effective regularization parameter, and $R(\theta)$ is an explicit regularizer that penalizes sharp minimizers. Our analysis uncovers an additional regularization effect of large learning rates beyond the linear scaling rule that penalizes large eigenvalues of the Hessian more than small ones. We also prove extensions to classification with general loss functions, SGD with momentum, and SGD with general noise covariance, significantly strengthening the prior work of Blanc et al. to global convergence and large learning rates and of HaoChen et al. to general models.
Alex Damian, Tengyu Ma, Jason Lee
-
[ Visit Poster at Spot C0 in Virtual World ]

Increasing capacity of neural network architectures by varying their width and depth has been central to their successes. However, recent work has shown that in overparameterized models, the hidden representations exhibit a block structure --- a large set of contiguous layers with highly similar representations. In this paper, we investigate how this block structure arises, its connection to the data, and the relationship between training mechanisms and the block structure. We begin by showing that the block structure representations are robust to small out-of-distribution shifts in the data. Leveraging insights connecting the block structure and the first principal component of the representations, we then demonstrate that the block structure arises from a small group of examples with similar image statistics. These examples have very large activation norms, and dominate the representational geometry of intermediate network layers. While these "dominant" datapoints are similar across all layers inside the block structure of a single network, different training runs lead to different sets of dominant datapoints. With these insights, we take an interventional approach, introducing a method to regularize the block structure, and also exploring how popular training mechanisms that help with performance can eliminate the block structure in the internal representations of overparameterized models.

Thao Nguyen, Maithra Raghu, Simon Kornblith
-
[ Visit Poster at Spot C1 in Virtual World ]

The deployment of convolutional neural networks is often hindered by high computational and storage requirements. Structured model pruning is a promising approach to alleviate these requirements. Using the VGG-16 model as an example, we measure the accuracy-efficiency trade-off for various structured model pruning methods and datasets (CIFAR-10 and ImageNet) on Tensor Processing Units (TPUs). To measure the actual performance of models, we develop a structured model pruning library for TensorFlow2 to modify models in place (instead of adding mask layers). We show that structured model pruning can significantly improve model memory usage and speed on TPUs without losing accuracy, especially for small datasets (e.g., CIFAR-10).

Kongtao Chen
-
[ Visit Poster at Spot C2 in Virtual World ]

Motivated by the growing literature on ``benign overfitting" in overparameterized models, we study benign overfitting in multiclass linear classification. Specifically, we consider the following popular training algorithms on separable data generated from Gaussian mixtures: (i) empirical risk minimization (ERM) with cross-entropy loss, which converges to the multiclass support vector machine (SVM) solution; (ii) ERM with least-squares loss, which converges to the min-norm interpolating (MNI) solution; and, (iii) the one-vs-all SVM classifier. Our first key finding is that under a simple sufficient condition, all three algorithms lead to classifiers that interpolate the training data and have equal accuracy. Second, we derive novel error bounds on the accuracy of the MNI classifier, thereby showing that all three training algorithms lead to benign overfitting under sufficient overparameterization. Ultimately, our analysis shows that good generalization is possible for SVM solutions beyond the realm in which typical margin-based bounds apply.

Ke Wang, Vidya Muthukumar, Christos Thrampoulidis
-
[ Visit Poster at Spot C3 in Virtual World ]
We study the function space characterization of the inductive bias resulting from controlling the $\ell_2$ norm of the weights in linear convolutional networks. We view this in terms of an *induced regularizer* in the function space given by the minimum norm of weights required to realize a linear function. For two layer linear convolutional networks with $C$ output channels and kernel size $K$, we show the following: (a) If the inputs to the network have a single channel, the induced regularizer for any $K$ is a norm given by a semidefinite program (SDP) that is *independent* of the number of output channels $C$. (b) In contrast, for networks with multi-channel inputs, multiple output channels can be necessary to merely realize all matrix-valued linear functions and thus the inductive bias \emph{does} depend on $C$. Further, for sufficiently large $C$, the induced regularizer for $K=1$ and $K=D$ are the nuclear norm and the $\ell_{2,1}$ group-sparse norm, respectively, of the Fourier coefficients. (c) Complementing our theoretical results, we show through experiments on MNIST and CIFAR-10 that our key findings extend to implicit biases from gradient descent in overparameterized networks.
Meena Jagadeesan, Ilya Razenshteyn, Suriya Gunasekar
-
[ Visit Poster at Spot C4 in Virtual World ]

We study the dynamics of temporal difference learning with neural network-based value function approximation over a general state space, namely, \emph{Neural TD learning}. We consider two practically used algorithms, \emph{projection-free} and \emph{max-norm regularized} Neural TD learning, and establish the first convergence bounds for these algorithms. An interesting observation from our results is that max-norm regularization can dramatically improve the performance of TD learning algorithms, both in terms of sample complexity and overparameterization. In particular, we prove that max-norm regularization achieves state-of-the-art sample complexity and overparameterization bounds by exploiting the geometry of the neural tangent kernel (NTK) function class. The results in this work rely on a novel Lyapunov drift analysis of the network parameters as a stopped and controlled random process.

Semih Cayci, Siddhartha Satpathi, Niao He, R Srikant
-
[ Visit Poster at Spot C5 in Virtual World ]
We present a novel analysis of feature selection in linear models by the convex framework of least absolute shrinkage operator (LASSO) and basis pursuit (BP). Our analysis pertains to a general overparametrized scenario. When the numbers of the features and the data samples grow proportionally, we obtain precise expressions for the asymptotic generalization error of LASSO and BP. Considering a mixture of strong and weak features, we provide insights into regularization trade-offs for double descent for $\ell_1$ norm minimization. We validate these results with numerical experiments.
David Bosch, Ashkan Panahi, Ayca Ozcelikkale
-
[ Visit Poster at Spot C6 in Virtual World ]

Training deep neural networks in low rank, i.e. with factorised layers, is of particular interest to the community: it offers efficiency over unfactorised training in terms of both memory consumption and training time. Prior work has focused on low rank approximations of pre-trained networks and training in low rank space with additional objectives, offering various ad hoc explanations for chosen practice. We analyse techniques that work well in practice, and through extensive ablations on models such as GPT2 we provide evidence falsifying common beliefs in the field, hinting in the process at exciting research opportunities that still need answering.

Sid Kamalakara, Acyr Locatelli, Bharat Venkitesh, Jimmy Ba, Yarin Gal, Aidan Gomez
-
[ Visit Poster at Spot D0 in Virtual World ]

Sparsity and low-rank structures have been incorporated into neural networks to reduce computational complexity and to improve generalization and robustness. Recent theoretical developments show that both are natural characteristics of data-fitting solutions cast in a new family of Banach spaces referred to as RBV2 spaces, the spaces of second-order bounded variation in the Radon domain. Moreover, sparse and deep ReLU networks are solutions to infinite dimensional variational problems in compositions of these spaces. This means that these learning problems can be recast as parametric optimizations over neural network weights. Remarkably, standard weight decay and variants correspond exactly to regularizing the RBV2-norm in the function space. Empirical validation in this paper confirm that weight decay leads to sparse and low-rank networks, as predicted by the theory.

Rahul Parhi, Jack Wolf, Robert Nowak
-
[ Visit Poster at Spot D1 in Virtual World ]

We analyze the learning dynamics of infinitely wide neural networks with a finite sized bottleneck. Unlike the neural tangent kernel limit, a bottleneck in an otherwise infinite width network allows data dependent feature learning in its bottleneck representation. We empirically show that a single bottleneck in infinite networks dramatically accelerates training when compared to purely infinite networks, with an improved overall performance. We discuss the acceleration phenomena by drawing similarities to infinitely wide deep linear models, where the acceleration effect of a bottleneck can be understood theoretically.

Etai Littwin, Omid Saremi, Shuangfei Zhai, Vimal Thilak, Hanlin Goh, Josh M Susskind, Greg Yang
-
[ Visit Poster at Spot D2 in Virtual World ]

In practice, classification models that generalize well are often susceptible to adversarial perturbations. We illustrate a novel estimation-centric explanation of adversarial susceptibility using an overparameterized linear model on lifted Fourier features. We show that the min 2-norm interpolator of training data can be susceptible even to adversaries who can only perturb the low-dimensional inputs and not the high-dimensional lifted features directly. The adversarial vulnerability arises because of a phenomena we term spatial localization: the predictions of the learned model are markedly more sensitive in the vicinity of training points than elsewhere. This sensitivity is crucially a consequence of feature lifting and can have consequences reminiscent of Gibb’s and Runge’s phenomena from signal processing and functional analysis. Despite the adversarial susceptibility, we find that classification using spatially localized features can be “easier” i.e. less sensitive to the strength of the prior than in independent feature setups. Our findings are replicated theoretically for a random-feature setup that exhibits double-descent behavior, and empirically for polynomial features.

Adhyyan Narang, Vidya Muthukumar, Anant Sahai
-
[ Visit Poster at Spot A1 in Virtual World ]

We identify and formalize a fundamental gradient descent phenomenon leading to a learning proclivity in over-parameterized neural networks. Gradient Starvation arises when cross-entropy loss is minimized by capturing only a subset of features relevant for the task, despite the presence of other predictive features that fail to be discovered. This work provides a theoretical explanation for the emergence of such feature imbalances in neural networks. Using tools from Dynamical Systems theory, we identify simple properties of learning dynamics during gradient descent that lead to this imbalance, and prove that such a situation can be expected given certain statistical structure in training data. Based on our proposed formalism, we develop guarantees for a novel but simple regularization method aimed at decoupling feature learning dynamics, improving accuracy and robustness in cases hindered by gradient starvation. We illustrate our findings with simple and real-world out-of-distribution (OOD) generalization experiments.

Mohammad Pezeshki, Oumar Kaba, Yoshua Bengio, Aaron Courville, Doina Precup, Guillaume Lajoie
-
[ Visit Poster at Spot A2 in Virtual World ]
Magnitude pruning is a common, effective technique to identify sparse subnetworks at little cost to accuracy. In this work, we ask whether a particular architecture's accuracy-sparsity tradeoff can be improved by combining pruning information across multiple runs of training. From a shared ResNet-20 initialization, we train several network copies (\textit{siblings}) to completion using different SGD data orders on CIFAR-10. While the siblings' pruning masks are naively not much more similar than chance, starting sibling training after a few epochs of shared pretraining significantly increases pruning overlap. We then choose a subnetwork by either (1) taking all weights that survive pruning in any sibling (mask union), or (2) taking only the weights that survive pruning across all siblings (mask intersection). The resulting subnetwork is retrained. Strikingly, we find that union and intersection masks perform very similarly. Both methods match the accuracy-sparsity tradeoffs of the one-shot magnitude pruning baseline, even when we combine masks from up to $k = 10$ siblings.
Rajiv Movva, Michael Carbin, Jonathan Frankle
-
[ Visit Poster at Spot A3 in Virtual World ]

A key challenge in building theoretical foundations for deep learning is their complex optimization dynamics. Such dynamics result from high-dimensional interactions between parameters, leading to non-trivial behaviors. In this regard, a particularly puzzling phenomenon is the ``double descent'' of the generalization error with increasing model complexity (model-wise) or training time (epoch-wise). While model-wise double descent has been a subject of extensive study of recent, the origins of the latter are much less clear. To bridge this gap, in this work, we leverage tools from statistical physics to study a simple teacher-student setup exhibiting epoch-wise double descent similar to deep neural networks. In this setting, we derive closed-form analytical expressions for the evolution of generalization error as a function of the training time. Crucially, this provides a new mechanistic explanation of epoch-wise double descent, suggesting that it can be attributed to features being learned at different time scales. Summarily, while a fast-learning feature is over-fitted, a slower-learning feature starts to fit, resulting in a non-monotonous generalization curve.

Mohammad Pezeshki, Amartya Mitra, Yoshua Bengio, Guillaume Lajoie
-
[ Visit Poster at Spot A4 in Virtual World ]

Deep linear networks trained with gradient descent yield low rank solutions, as is typically studied in matrix factorization. In this paper, we take a step further and analyze implicit rank regularization in autoencoders. We show greedy learning of low-rank latent codes induced by a linear sub-network at the autoencoder bottleneck. We further propose orthogonal initialization and principled learning rate adjustment to mitigate sensitivity of training dynamics to spectral prior and linear depth. With linear autoencoders on synthetic data, our method converges stably to ground-truth latent code rank. With nonlinear autoencoders, our method converges to latent ranks optimal for downstream classification and image sampling.

Shih-Yu Sun, Vimal Thilak, Etai Littwin, Omid Saremi, Josh M Susskind
-
[ Visit Poster at Spot A5 in Virtual World ]

We empirically show that the test error of deep networks can be estimated simply by training the same architecture on the same training set but with a different run of SGD, and measuring the disagreement rate between the two networks on unlabeled test data. This builds on -- and is a stronger version of -- the observation in Nakirran & Bansal (20), which requires the second run to be on an altogether fresh training set. We further theoretically show that this peculiar phenomenon arises from the well-calibrated nature of ensembles of SGD-trained models. This finding not only provides a simple empirical measure to directly predict the test error using unlabeled test data, but also establishes a new conceptual connection between generalization and calibration.

YiDing Jiang, Vaishnavh Nagarajan, Zico Kolter
-
[ Visit Poster at Spot A6 in Virtual World ]

Modern machine learning systems such as deep neural networks are often highly over-parameterized so that they can fit the noisy training data exactly, yet they can still achieve small test errors in practice. In this paper, we study this ``benign overfitting'' phenomenon of the maximum margin classifier for linear classification problems. Specifically, we consider data generated from sub-Gaussian mixtures, and provide a tight risk bound for the maximum margin linear classifier in the over-parameterized setting. Our results precisely characterize the condition under which benign overfitting can occur in linear classification problems, and improve on previous work. They also have direct implications for over-parameterized logistic regression.

Yuan Cao, Yuan Cao, Quanquan Gu, Mikhail Belkin
-
[ Visit Poster at Spot B0 in Virtual World ]

Deep neural networks are a type of over-parameterized models which are able to achieve high performance despite having typically more parameters than training samples. Recently, there has been an increasing interest in uncovering and understanding the different phenomena that occur in the over-parameterized regime induced by such neural networks. In this paper, we aim to shed light on the relationship between class compactness of the learned feature representations and the model performance. Surprisingly, we find that models that learn more class-invariant features do not necessarily perform better. Moreover, we show that during training, class-wise variance increases and the models learn a less compact and more outspread representation of the classes.

Kateryna Chumachenko, Firas Laakom, Jenni Raitoharju, Alexandros Iosifidis, Moncef Gabbouj
-
[ Visit Poster at Spot B1 in Virtual World ]
We apply the theory introduced in \cite{jacot2020kernel} to study the risk (i.e. generalization error) in a polynomial regression setup where the data points $ x_i $'s are i.i.d. samples from the uniform distribution on $[-1,1]$ and the observations $ y_i = f^*(x_i) + \epsilon e_i $ are generated by a continuous true function $ f^* $ where $ e_i $'s are standard Gaussian noise and $ \epsilon $ is the noise level. In our setup, we can precisely compute the Signal Capture Threshold (SCT) as a function of the number of polynomial features $ P+1 $, the number of samples $ N $, and the ridge $ \lambda > 0 $ which enables a precise analysis of the risk and an explanation of the overfitting of polynomial features with overparameterization.
Hugo Fabregues, Berfin Simsek
-
[ Visit Poster at Spot B2 in Virtual World ]

Memorization studies of deep neural networks (DNNs) help to understand what patterns and how do DNNs learn, and motivate improvements to DNN training approaches. In this work, we investigate the memorization properties of SimCLR, a widely used contrastive self-supervised learning approach, and compare them to the memorization of supervised learning and random labels training. We find that both training objects and augmentations may have different complexity in the sense of how SimCLR learns them. Moreover, we show that SimCLR is similar to random labels training in terms of the distribution of training objects complexity.

Ildus Sadrtdinov, Nadezhda Chirkova, Ekaterina Lobacheva
-
[ Visit Poster at Spot B3 in Virtual World ]

Convolutional Neural Networks (CNNs) have been dominating classification tasks in various domains, such as machine vision, machine listening, and natural language processing. In machine listening, while generally exhibiting very good generalization capabilities, CNNs are sensitive to the specific audio recording device used, which has been recognized as a substantial problem in the acoustic scene classification (DCASE) community. In this study, we investigate the relationship between over-parameterization of acoustic scene classification models, and their resulting generalization abilities. Specifically, we test scaling CNNs in width and depth, under different conditions. Our results indicate that increasing width improves generalization to unseen devices, even without an increase in the number of parameters.

Khaled Koutini, Khaled Koutini, Hamid Eghbalzadeh, Florian Henkel, Jan Schlüter, Gerhard Widmer
-
[ Visit Poster at Spot B4 in Virtual World ]

Numerous recent works show that overparameterization implicitly reduces variance for minimum-norm interpolators, suggesting vanishing benefits for ridge regularization in high dimensions. However, empirical findings suggest that this narrative may not hold true for robust generalization. In this paper we reveal that for overparameterized linear regression, the robust risk is minimized for a positive regularization coefficient even when the training data is noiseless. Hence, we effectively provide, to the best of our knowledge, the first theoretical analysis on the phenomenon of robust overfitting.

Konstantin Donhauser, Alexandru Tifrea, Michael Aerni, Reinhard Heckel, Fanny Yang
-
[ Visit Poster at Spot B5 in Virtual World ]

Deep neural networks generalize well despite being exceedingly overparameterized and being trained without explicit regularization. This curious phenomenon, often termed benign overfitting, has inspired extensive research activity in establishing its statistical principles. In this work, we study both max-margin SVM and min-norm interpolating classifiers. First, we leverage an idea introduced in [V. Muthukumar et al., arXiv:2005.08054, (2020)] to relate the SVM solution to the least-squares (LS) interpolating solution. Second, we derive non-asymptotic bounds on the classification error of the LS solution. Combining the two, we present sufficient conditions on the overparameterization ratio and on the signal-to-noise ratio (SNR) for benign overfitting to occur. Moreover, we investigate the role of regularization and identify precise conditions under which the interpolating estimator performs better than the regularized estimates. We corroborate our theoretical findings with numerical simulations.

Ke Wang, Christos Thrampoulidis
-
[ Visit Poster at Spot B6 in Virtual World ]

The goal in label-imbalanced and group-sensitive classification is to optimize metrics such as balanced error and equal opportunity. Classical methods like re-weighted cross-entropy, are known to fail when used with the modern practice of training deep nets to the terminal phase of training (TPT), that is training beyond zero training error. In contrast to previous heuristics, we follow a principled analysis explaining how different loss adjustments affect margins. First, we prove that for linear classifiers trained in TPT, it is necessary to introduce multiplicative, rather than additive, logit adjustments so that the relative margins between classes change appropriately. To show this, we discover a connection of the multiplicative CE modification to the cost-sensitive support-vector machines. While additive adjustments are ineffective deep in the TPT, we show numerically that they can speed up convergence by countering an initial negative effect of the multiplicative weights. Motivated by these findings, we formulate the vector-scaling (VS) loss, that captures existing techniques as special cases. For Gaussian-mixtures data, we perform a generalization analysis, revealing tradeoffs between balanced / standard error and equal opportunity.

Ganesh Ramachandra Kini, Orestis Paraskevas, Samet Oymak, Christos Thrampoulidis
-
[ Visit Poster at Spot C0 in Virtual World ]

This work studies the behavior of neural networks trained with the logistic loss via gradient descent on binary classification data where the underlying data distribution is general, and the (optimal) Bayes risk is not necessarily zero. In this setting, it is shown that gradient descent with early stopping achieves population risk arbitrarily close to optimal in terms of not just logistic and misclassification losses, but also in terms of calibration, meaning the sigmoid mapping of its outputs approximates the true underlying conditional distribution arbitrarily finely. Moreover, the necessary iteration, sample, and architectural complexities of this analysis all scale naturally with a certain complexity measure of the true conditional model. Lastly, while it is not shown that early stopping is necessary, it is shown that any univariate classifier satisfying a local interpolation property is necessarily inconsistent.

Ziwei Ji, Matus Telgarsky
-
[ Visit Poster at Spot C1 in Virtual World ]

We discover a new set of empirical properties of interpolating classifiers, including neural networks, kernel machines and decision trees. Informally, the output distribution of an interpolating classifier matches the distribution of true labels, when conditioned on certain subgroups of the input space. For example, if we mislabel 30% of images of dogs as cats in the train set of CIFAR-10, then a ResNet trained to interpolation will in fact mislabel roughly 30% of dogs as cats on the test set as well, while leaving other classes unaffected. These behaviors are not captured by classical generalization, which would only consider the average error over the inputs, and not where these errors occur. We introduce and experimentally validate a formal conjecture that specifies the subgroups for which we expect this distributional closeness. Further, we show that these properties can be seen as a new form of generalization, which advances our understanding of the implicit bias of interpolating methods.

Preetum Nakkiran, Yamini Bansal
-
[ Visit Poster at Spot C2 in Virtual World ]

As its width tends to infinity, a deep neural network's behavior under gradient descent can become simplified and predictable (e.g.\ given by the Neural Tangent Kernel (NTK)), if it is parametrized appropriately (e.g.\ the NTK parametrization). However, we show that the standard and NTK parametrizations of a neural network do not admit infinite-width limits that can \emph{learn} features, which is crucial for pretraining and transfer learning such as with BERT. We propose simple modifications to the standard parametrization to allow for feature learning in the limit. Using the \emph{Tensor Programs} technique, we derive explicit formulas for such limits. On Word2Vec and few-shot learning on Omniglot via MAML, two canonical tasks that rely crucially on feature learning, we compute these limits exactly. We find that they outperform both NTK baselines and finite-width networks, with the latter approaching the infinite-width feature learning performance as width increases.

Greg Yang, Edward Hu
-
[ Visit Poster at Spot C3 in Virtual World ]

Even though the goal of pruning is often to reduce the computational resources consumed during training or inference, it comes as no surprise to theoreticians or practitioners that pruning also improves generalization. In this work, we empirically study pruning's effect on generalization, focusing on two state-of-the-art pruning algorithms: weight rewinding and learning-rate rewinding. However, each pruning algorithm is actually an aggregation of many design choices: a weight scoring heuristic, a pruning schedule, a learning rate schedule, among other factors, each of which might contribute to generalization improvement in different ways. We thus ablate each design choice to determine whether it is responsible for pruning's effect on generalization. We find that each individual contribution is limited compared to the generalization improvement achieved with the pruning algorithm in its entirety. Our results also highlight similarities and differences between the effects on generalization caused by pruning and model-width scaling.

Tian Jin, Gintare Karolina Dziugaite, Michael Carbin
-
[ Visit Poster at Spot C4 in Virtual World ]
Classically, data interpolation with a parametrized model class is possible as long as the number of parameters is larger than the number of equations to be satisfied. A puzzling phenomenon in deep learning is that models are trained with many more parameters than what this classical theory would suggest. We propose a theoretical explanation for this phenomenon. We prove that for a broad class of data distributions and model classes, overparametrization is necessary if one wants to interpolate the data smoothly. Namely we show that smooth interpolation requires $d$ times more parameters than mere interpolation, where $d$ is the ambient data dimension. We prove this universal law of robustness for any smoothly parametrized function class with polynomial size weights, and any covariate distribution verifying isoperimetry (or a mixture thereof). In the case of two-layer neural networks and Gaussian covariates, this law was conjectured in prior work by Bubeck, Li and Nagaraj.
Sebastien Bubeck, Mark Sellke
-
[ Visit Poster at Spot C5 in Virtual World ]

This paper examines the impact of static sparsity on the robustness of a trained network to weight perturbations, data corruption, and adversarial examples. We show that, up to a certain sparsity achieved by increasing network width and depth while keeping the network capacity fixed, sparsified networks consistently match and often outperform their initially dense versions. Robustness and accuracy decline simultaneously for very high sparsity due to loose connectivity between network layers. Our findings show that a rapid robustness drop caused by network compression observed in the literature is due to a reduced network capacity rather than sparsity.

Lukas Timpl, Rahim Entezari, Hanie Sedghi, Behnam Neyshabur, Olga Saukh
-
[ Visit Poster at Spot C6 in Virtual World ]
It is widely recognized that, despite perfectly interpolating the training data, deep neural networks (DNNs) can still generalize well due in part to the ``implicit regularization'' induced by the learning algorithm. Nonetheless, ``explicit regularization'' (or weight decay) is often used to avoid overfitting, especially when the data is known to be corrupted. There are several challenges with using explicit regularization, most notably unclear convergence properties. In this paper, we propose a novel variant of the stochastic mirror descent (SMD) algorithm, called \emph{regularizer mirror descent (RMD)}, for training DNNs. The starting point for RMD is a cost which is the sum of the training loss and any convex regularizer of the network weights. For highly-overparameterized models, RMD provably converges to a point ``close'' to the optimal solution of this cost. The algorithm imposes virtually no additional computational burden compared to stochastic gradient descent (SGD) or weight decay, and is parallelizable in the same manner that they are. Our experimental results on training sets that contain some errors suggest that, in terms of generalization performance, RMD outperforms both SGD, which implicitly regularizes for the $\ell_2$ norm of the weights, and weight decay, which explicitly does so. This makes RMD a viable option for training with regularization in DNNs. In addition, RMD can be used for regularizing the weights to be close to a desired weight vector, which is particularly important for continual learning.
Navid Azizan, Sahin Lale, Babak Hassibi
-
[ Visit Poster at Spot D0 in Virtual World ]

While deep reinforcement learning (RL) methods present an appealing approach to sequential decision-making, such methods are often unstable in practice. What accounts for this instability? Recent theoretical analysis of overparameterized supervised learning with stochastic gradient descent shows that learning is driven by an implicit regularizer once it reaches the zero training error regime, which results in “simpler” functions that generalize. However, in this paper, we show that in the case of deep RL, implicit regularization can instead lead to degenerate features by characterizing the induced implicit regularizer in the semi-gradient TD learning setting in RL with a fixed-dataset (i.e., offline RL). The regularize recapitulates recent empirical findings regarding the rank collapse of learned features and provides an understanding for its cause. To address the adverse impacts of this implicit regularization, we propose a simple and effective explicit regularizer for TD learning, DR3. DR3 minimizes the similarity of learned features of the Q-network at consecutive state-action tuples in the TD update. Empirically, when combined with existing offline RL methods, DR3 substantially improves both performance and stability on Atari 2600 games, D4RL domains, and robotic manipulation from images.

Aviral Kumar, Rishabh Agarwal, Aaron Courville, Tengyu Ma, George Tucker, Sergey Levine
-
[ Visit Poster at Spot D1 in Virtual World ]

Stochastic gradient descent (SGD) with momentum is widely used for training modern deep learning architectures. While it is well understood that using momentum can lead to faster convergence rate in various settings, it has also been empirically observed that adding momentum yields higher generalization. This paper formally studies how momentum help generalization in deep learning:

Samy Jelassi, Yuanzhi Li