Modern machine learning models are often highly overparameterized. The prime examples are neural network architectures achieving stateoftheart performance, which have many more parameters than training examples. While these models can empirically perform very well, they are not well understood. Worstcase theories of learnability do not explain their behavior. Indeed, overparameterized models sometimes exhibit "benign overfitting", i.e., they have the power to perfectly fit training data (even data modified to have random labels), yet they achieve good performance on the test data. There is evidence that overparameterization may be helpful both computational and statistically, although attempts to use phenomena like double/multiple descent to explain that overparameterization helps to achieve small test error remain controversial. Besides benign overfitting and double/multiple descent, many other interesting phenomena arise due to overparameterization, and many more may have yet to be discovered. Many of these effects depend on the properties of data, but we have only simplistic tools to measure, quantify, and understand data. In light of rapid progress and rapidly shifting understanding, we believe that the time is ripe for a workshop focusing on understanding overparameterization from multiple angles.
Gathertown room1 link: https://eventhosts.gather.town/DbHJbA5ArXpTIoap/icmloppo2021
Gathertown room2 link: https://eventhosts.gather.town/UtqQ3jSJ7wnN0anj/icmloppo2021room2
Sat 9:00 a.m.  9:05 a.m.

Opening Remarks
(opening)
SlidesLive Video » 
🔗 
Sat 9:05 a.m.  9:50 a.m.

Adversarial Examples in Random Deep Networks
(Invited talk)
SlidesLive Video » Because the phenomenon of adversarial examples in deep networks poses a serious barrier to the reliable and robust application of this methodology, there has been considerable interest in why it arises. We consider ReLU networks of constant depth with independent Gaussian parameters, and show that small perturbations of input vectors lead to large changes of outputs. Building on insights of Daniely and Schacham (2020) and of Bubeck et al (2021), we show that adversarial examples arise in these networks because the functions that they compute are very close to linear. The main result is for networks with a constant depth, but we also show that some constraint on depth is necessary for a result of this kind: there are suitably deep networks that, with constant probability, compute a function that is close to constant. This, combined with results characterizing benign overfitting in linear regression, suggests two potential mechanisms behind adversarial examples in overparameterized settings, one arising from label noise and the other from random initialization. Joint work with Sébastien Bubeck and Yeshwanth Cherapanamjeri 
Peter Bartlett 🔗 
Sat 9:50 a.m.  10:00 a.m.

Live Q&A with Peter Bartlett
(Live Q&A)

🔗 
Sat 10:00 a.m.  10:55 a.m.

The PolyakLojasiewicz condition as a framework for overparameterized optimization and its application to deep learning
(Invited talk)
SlidesLive Video » The success of deep learning is due, to a large extent, to the remarkable effectiveness of gradientbased optimization methods applied to large neural networks. In this talk, I will discuss some general mathematical principles allowing for efficient optimization in overparameterized nonlinear systems, a setting that includes deep neural networks. I will discuss that optimization problems corresponding to these systems are not convex, even locally, but instead satisfy the PolyakLojasiewicz (PL) condition on most of the parameter space, allowing for efficient optimization by gradient descent or SGD. I will connect the PL condition of these systems to the condition number associated with the tangent kernel and show how a nonlinear theory for those systems parallels classical analyses of overparameterized linear equations. As a separate related development, I will discuss a perspective on the remarkable recently discovered phenomenon of transition to linearity (constancy of NTK) in certain classes of large neural networks. I will show how this transition to linearity results from the scaling of the Hessian with the size of the network controlled by certain functional norms. Combining these ideas, I will show how the transition to linearity can be used to demonstrate the PL condition and convergence for a general class of wide neural networks. Finally, I will comment on systems that are ''almost'' overparameterized, which appears to be common in practice. Based on joint work with Chaoyue Liu and Libin Zhu 
Mikhail Belkin 🔗 
Sat 10:55 a.m.  11:10 a.m.

Distributional Generalization: A New Kind of Generalization (Extended Abstract)
(Spotlight)
SlidesLive Video » We discover a new set of empirical properties of interpolating classifiers, including neural networks, kernel machines and decision trees. Informally, the output distribution of an interpolating classifier matches the distribution of true labels, when conditioned on certain subgroups of the input space. For example, if we mislabel 30% of images of dogs as cats in the train set of CIFAR10, then a ResNet trained to interpolation will in fact mislabel roughly 30% of dogs as cats on the test set as well, while leaving other classes unaffected. These behaviors are not captured by classical generalization, which would only consider the average error over the inputs, and not where these errors occur. We introduce and experimentally validate a formal conjecture that specifies the subgroups for which we expect this distributional closeness. Further, we show that these properties can be seen as a new form of generalization, which advances our understanding of the implicit bias of interpolating methods. 
Preetum Nakkiran · Yamini Bansal 🔗 
Sat 11:10 a.m.  11:25 a.m.

Understanding the effect of sparsity on neural networks robustness
(Spotlight)
SlidesLive Video » This paper examines the impact of static sparsity on the robustness of a trained network to weight perturbations, data corruption, and adversarial examples. We show that, up to a certain sparsity achieved by increasing network width and depth while keeping the network capacity fixed, sparsified networks consistently match and often outperform their initially dense versions. Robustness and accuracy decline simultaneously for very high sparsity due to loose connectivity between network layers. Our findings show that a rapid robustness drop caused by network compression observed in the literature is due to a reduced network capacity rather than sparsity. 
Lukas Timpl · Rahim Entezari · Hanie Sedghi · Behnam Neyshabur · Olga Saukh 🔗 
Sat 11:25 a.m.  11:40 a.m.

On the Generalization Improvement from Neural Network Pruning
(Spotlight)
SlidesLive Video » Even though the goal of pruning is often to reduce the computational resources consumed during training or inference, it comes as no surprise to theoreticians or practitioners that pruning also improves generalization. In this work, we empirically study pruning's effect on generalization, focusing on two stateoftheart pruning algorithms: weight rewinding and learningrate rewinding. However, each pruning algorithm is actually an aggregation of many design choices: a weight scoring heuristic, a pruning schedule, a learning rate schedule, among other factors, each of which might contribute to generalization improvement in different ways. We thus ablate each design choice to determine whether it is responsible for pruning's effect on generalization. We find that each individual contribution is limited compared to the generalization improvement achieved with the pruning algorithm in its entirety. Our results also highlight similarities and differences between the effects on generalization caused by pruning and modelwidth scaling. 
Tian Jin · Gintare Karolina Dziugaite · Michael Carbin 🔗 
Sat 12:30 p.m.  1:25 p.m.

Overparametrization: Insights from solvable models
(Invited talk)
SlidesLive Video » This talk gives an overview of recent results in a line of theoretical work that started 3 decades ago in statistical physics. We will first discuss teacherstudent setting of the generalized linear regression. We illustrate the presence of the interpolation peak for classification with ridge loss and its vanishing with regularization. We show that, in the spherical perceptron, the optimally regularized logistic regression approaches very closely the Bayes optimal accuracy. We contrast this with the nonconvex case of phase retrieval where the canonical empirical risk minimization performs poorly compared to the Bayesoptimal error. We then move towards learning with hidden units and analyze double descent in learning with generic fixed features and any convex loss. The formulas we obtain a generic enough to describe the learning of the last layer of neural networks for realistic data and networks. Finally, for the phase retrieval, we are able to analyze gradient descent in the featurelearning regime of a twolayer neural network where we show that overparametrization allows a considerable reduction of the sample complexity. Concretely, an overparametrized neural network only needs twice the input dimension of samples, while nonoverparametrized network needs constant times more, and kernel regression quadratically many samples in the input dimension. 
Lenka Zdeborova 🔗 
Sat 1:25 p.m.  2:20 p.m.

The generalization behavior of random feature and neural tangent models
(Invited talk)
SlidesLive Video » I consider two layer neural networks trained with square loss in the linear (lazy) regime. Under overparametrization, gradient descent converges to the minimum norm interpolant, and I consider this as well as the hole ridge regularization path. From a statistical viewpoint, these approaches are random features models, albeit of a special type. They are also equivalent to kernel ridge regression, with a random kernel of rank N*d (where N is the number of hidden neurons, and d the input dimension). I will describe a precise characterization of the generalization error when N, d and the sample size are polynomially related (and for covariates that are uniform on the ddimensional sphere). I will then discuss the limitation of these approaches. I will explain how sparse random feature models can be learnt efficiently to try to address these limitations. [Based on joint work with Michael Celentano, Song Mei, Theodor Misiakiewicz, Yiqiao Zhong] 
Andrea Montanari 🔗 
Sat 2:20 p.m.  2:35 p.m.

Towards understanding how momentum improves generalization in deep learning
(Spotlight)
SlidesLive Video » Stochastic gradient descent (SGD) with momentum is widely used for training modern deep learning architectures. While it is well understood that using momentum can lead to faster convergence rate in various settings, it has also been empirically observed that adding momentum yields higher generalization. This paper formally studies how momentum help generalization in deep learning: 
Samy Jelassi · Yuanzhi Li 🔗 
Sat 2:35 p.m.  2:50 p.m.

Feature Learning in InfiniteWidth Neural Networks
(Spotlight)
SlidesLive Video » As its width tends to infinity, a deep neural network's behavior under gradient descent can become simplified and predictable (e.g.\ given by the Neural Tangent Kernel (NTK)), if it is parametrized appropriately (e.g.\ the NTK parametrization). However, we show that the standard and NTK parametrizations of a neural network do not admit infinitewidth limits that can \emph{learn} features, which is crucial for pretraining and transfer learning such as with BERT. We propose simple modifications to the standard parametrization to allow for feature learning in the limit. Using the \emph{Tensor Programs} technique, we derive explicit formulas for such limits. On Word2Vec and fewshot learning on Omniglot via MAML, two canonical tasks that rely crucially on feature learning, we compute these limits exactly. We find that they outperform both NTK baselines and finitewidth networks, with the latter approaching the infinitewidth feature learning performance as width increases. 
Greg Yang · Edward Hu 🔗 
Sat 2:50 p.m.  3:05 p.m.

A Universal Law of Robustness via Isoperimetry
(Spotlight)
SlidesLive Video »
Classically, data interpolation with a parametrized model class is possible as long as the number of parameters is larger than the number of equations to be satisfied. A puzzling phenomenon in deep learning is that models are trained with many more parameters than what this classical theory would suggest. We propose a theoretical explanation for this phenomenon. We prove that for a broad class of data distributions and model classes, overparametrization is necessary if one wants to interpolate the data smoothly. Namely we show that smooth interpolation requires $d$ times more parameters than mere interpolation, where $d$ is the ambient data dimension. We prove this universal law of robustness for any smoothly parametrized function class with polynomial size weights, and any covariate distribution verifying isoperimetry (or a mixture thereof). In the case of twolayer neural networks and Gaussian covariates, this law was conjectured in prior work by Bubeck, Li and Nagaraj.

Sebastien Bubeck · Mark Sellke 🔗 
Sat 3:55 p.m.  4:50 p.m.

Universal Prediction Band, SemiDefinite Programming and Variance Interpolation
(Invited talk)
SlidesLive Video » A frequent criticism from the statistics community to modern machine learning is the lack of rigorous uncertainty quantification. Instead, the machine learning community would argue that conventional uncertainty quantification based on idealized distributional assumptions may be too restrictive for real data. This paper will make progress in resolving the above inference dilemma. We propose a computationally efficient method to construct nonparametric, heteroskedastic prediction bands for uncertainty quantification, with or without any userspecified predictive model. The dataadaptive prediction band is universally applicable with minimal distributional assumptions, with strong nonasymptotic coverage properties, and easy to implement using standard convex programs. Our approach can be viewed as a novel variance interpolation with confidence and further leverages techniques from semidefinite programming and sumofsquares optimization. Theoretical and numerical performances for the proposed approach for uncertainty quantification are analyzed. 
Tengyuan Liang 🔗 
Sat 4:50 p.m.  5:45 p.m.

Function space view of MultiChannel Linear Convolutional Networks with Bounded Weight Norm
(Invited talk)
SlidesLive Video » The magnitude of the weights of a neural network is a fundamental measure of complexity that plays a crucial role in the study of implicit and explicit regularization. For example, in recent work, gradient descent updates in overparameterized models asymptotically lead to solutions that implicitly minimize the ell2 norm of the parameters of the model, resulting in an inductive bias that is highly architecturedependent. To investigate the properties of learned functions, it is natural to consider a function space view given by the minimum ell2 norm of weights required to realize a given function with a given network. We call this the “induced regularizer” of the network. Building on a line of recent work, we study the induced regularizer of linear convolutional neural networks with a focus on the role of kernel size and the number of channels. We introduce an SDP relaxation of the induced regularizer, that we show is tight for networks with a single input channel. Using this SDP formulation, we show that the induced regularizer is independent of the number of the output channels for singleinput channel networks, and for multiinput channel networks, we show independence given sufficiently many output channels. Moreover, we show that as the kernel size increases, the induced regularizer interpolates between a basisinvariant norm and a basisdependent norm that promotes sparse structures in Fourier space. Based on joint work with Meena Jagadeesan and Ilya Razenshteyn. 
Suriya Gunasekar 🔗 
Sat 5:45 p.m.  6:00 p.m.

ValueBased Deep Reinforcement Learning Requires Explicit Regularization
(Spotlight)
SlidesLive Video » While deep reinforcement learning (RL) methods present an appealing approach to sequential decisionmaking, such methods are often unstable in practice. What accounts for this instability? Recent theoretical analysis of overparameterized supervised learning with stochastic gradient descent shows that learning is driven by an implicit regularizer once it reaches the zero training error regime, which results in “simpler” functions that generalize. However, in this paper, we show that in the case of deep RL, implicit regularization can instead lead to degenerate features by characterizing the induced implicit regularizer in the semigradient TD learning setting in RL with a fixeddataset (i.e., offline RL). The regularize recapitulates recent empirical findings regarding the rank collapse of learned features and provides an understanding for its cause. To address the adverse impacts of this implicit regularization, we propose a simple and effective explicit regularizer for TD learning, DR3. DR3 minimizes the similarity of learned features of the Qnetwork at consecutive stateaction tuples in the TD update. Empirically, when combined with existing offline RL methods, DR3 substantially improves both performance and stability on Atari 2600 games, D4RL domains, and robotic manipulation from images. 
Aviral Kumar · Rishabh Agarwal · Aaron Courville · Tengyu Ma · George Tucker · Sergey Levine 🔗 
Sat 6:00 p.m.  6:15 p.m.

Beyond Implicit Regularization: Avoiding Overfitting via Regularizer Mirror Descent
(Spotlight)
SlidesLive Video »
It is widely recognized that, despite perfectly interpolating the training data, deep neural networks (DNNs) can still generalize well due in part to the ``implicit regularization'' induced by the learning algorithm. Nonetheless, ``explicit regularization'' (or weight decay) is often used to avoid overfitting, especially when the data is known to be corrupted. There are several challenges with using explicit regularization, most notably unclear convergence properties. In this paper, we propose a novel variant of the stochastic mirror descent (SMD) algorithm, called \emph{regularizer mirror descent (RMD)}, for training DNNs. The starting point for RMD is a cost which is the sum of the training loss and any convex regularizer of the network weights. For highlyoverparameterized models, RMD provably converges to a point ``close'' to the optimal solution of this cost. The algorithm imposes virtually no additional computational burden compared to stochastic gradient descent (SGD) or weight decay, and is parallelizable in the same manner that they are. Our experimental results on training sets that contain some errors suggest that, in terms of generalization performance, RMD outperforms both SGD, which implicitly regularizes for the $\ell_2$ norm of the weights, and weight decay, which explicitly does so. This makes RMD a viable option for training with regularization in DNNs. In addition, RMD can be used for regularizing the weights to be close to a desired weight vector, which is particularly important for continual learning.

Navid Azizan · Sahin Lale · Babak Hassibi 🔗 
Sat 6:15 p.m.  6:20 p.m.

Closing Remarks
(closing)
SlidesLive Video » 
🔗 


Generalization Error and Overparameterization While Learning over Networks
(Poster)
[ Visit Poster at Spot A2 in Virtual World ]
We investigate the performance of distributed learning for largescale linear regression where the model unknowns are distributed over the network. We provide highprobability bounds on the generalization error for this distributed learning setting for isotropic as well as correlated Gaussian regressors. Our investigations show that the generalization error of the distributed solution can grow unboundedly even though the training error is low. We highlight the effect of partitioning of the training data over the network of learners on the generalization error. Our results are particularly interesting for the overparametrized scenario, illustrating fast convergence but also possibly unbounded generalization error. 
Martin Hellkvist · Ayca Ozcelikkale 🔗 


On the interplay between data structure and loss function: an analytical study of generalization for classification
(Poster)
[ Visit Poster at Spot A1 in Virtual World ]
[ Visit Poster at Spot A0 in Virtual World ]
In this work, we present an analytical solution for a simple model of classification problems on structured data. Using methods from statistical physics, we obtain a precise asymptotic expression for the test errors of random feature models trained on a strong and weak features  a model of data with input data covariance built from independent blocks allowing us to tune the saliency of lowdimensional structures and their alignment with respect to the target function. Leveraging our analytical result, we explore how properties of data distributions impact generalization in the overparametrized regime and compare results for the logistic and square loss. Our results show in particular that the logistic loss benefits more robustly from structured data than the squared loss. Numerical experiments on MNIST and CIFAR10 confirm this insight. 
Stéphane d'Ascoli · Marylou Gabrié · Levent Sagun · Giulio Biroli 🔗 


FiniteSample Analysis of Interpolating Linear Classifiers in the Overparameterized Regime
(Poster)
[ Visit Poster at Spot A3 in Virtual World ]
We prove bounds on the population risk of the maximum margin algorithm for twoclass linear classification. For linearly separable training data, the maximum margin algorithm has been shown in previous work to be equivalent to a limit of training with logistic loss using gradient descent, as the training error is driven to zero. We analyze this algorithm applied to random data including misclassification noise. Our assumptions on the clean data include the case in which the classconditional distributions are standard normal distributions. The misclassification noise may be chosen by an adversary, subject to a limit on the fraction of corrupted labels. Our bounds show that, with sufficient overparameterization, the maximum margin algorithm trained on noisy data can achieve nearly optimal population risk. 
Niladri Chatterji · Phil Long 🔗 


Some samples are more similar than others! A different look at memorization and generalization in neural networks.
(Poster)
[ Visit Poster at Spot A4 in Virtual World ]
Neural networks can overfit the noisy training data and still generalize well on the unseen data. Current methods focus on training dynamics or varying the noise in the dataset to understand this phenomenon. We propose a method that allows us to compare the similarity between two sets of samples for a given network. Our approach relies on the weights learned by the network and is independent of the training dynamics & properties of the training dataset. Using the proposed method, we investigate three hypotheses empirically: Are real and noisy samples learned at different parts of the network? Do real and noisy samples contribute equally to the generalization? Are real samples more similar to each other than the noisy samples? 
Sudhanshu Ranjan 🔗 


When does gradient descent with logistic loss interpolate using deep networks with smoothed ReLU activations?
(Poster)
[ Visit Poster at Spot A5 in Virtual World ]
We establish conditions under which gradient descent applied to fixedwidth deep networks drives the logistic loss to zero, and prove bounds on the rate of convergence. Our analysis applies for smoothed approximations to the ReLU, such as Swish and the Huberized ReLU, proposed in previous applied work. We provide two sufficient conditions for convergence. The first is simply a bound on the loss at initialization. The second is a data separation condition used in prior analyses. 
Niladri Chatterji · Phil Long · Peter Bartlett 🔗 


On Alignment in Deep Linear Neural Networks
(Poster)
[ Visit Poster at Spot A6 in Virtual World ]
We study the properties of alignment, a form of implicit regularization, in linear neural networks under gradient descent. We define alignment for fully connected networks with multidimensional outputs and show that it is a natural extension of alignment in networks with 1d outputs as defined in (Ji & Telgarsky, 2018). While in fully connected networks there always exists a global minimum corresponding to an aligned solution, we analyze alignment as it relates to the training process. Namely, we characterize when alignment is an invariant of training under gradient descent by providing necessary and sufficient conditions for this invariant to hold. In such settings, the dynamics of gradient descent simplify, thereby allowing us to provide an explicit learning rate under which the network converges linearly to a global minimum. We conclude by analyzing networks with layer constraints, such as convolutional networks, and prove that alignment is impossible with sufficiently large datasets. 
Adityanarayanan Radhakrishnan · Eshaan Nichani · Daniel Bernstein · Caroline Uhler 🔗 


Increasing Depth Leads to UShaped Test Risk in Overparameterized Convolutional Networks
(Poster)
[ Visit Poster at Spot B0 in Virtual World ]
Recent works have demonstrated that increasing model capacity through width in overparameterized neural networks leads to a decrease in test risk. Model capacity, however, can also be increased through depth, yet understanding the impact of increasing depth on test risk remains an open question. In this work, we demonstrate that the test risk of overparameterized convolutional networks is a Ushaped curve (i.e. monotonically decreasing, then increasing) with increasing depth. We first provide empirical evidence for this phenomenon via image classification experiments using both ResNets and the convolutional neural tangent kernel (CNTK). We then present a novel linear regression framework for characterizing the impact of depth on test risk, and show that increasing depth leads to a Ushaped test risk for the linear CNTK. In particular, we prove that the linear CNTK corresponds to a depthdependent linear transformation on the original space and characterize properties of this transformation. We then analyze overparameterized linear regression under arbitrary linear transformations and, in simplified settings, identify the depths which minimize the bias and variance terms of the test risk. 
Eshaan Nichani · Adityanarayanan Radhakrishnan · Caroline Uhler 🔗 


How does OverParametrization Lead to Acceleration for Learning a Single Teacher Neuron with Quadratic Activation?
(Poster)
[ Visit Poster at Spot B1 in Virtual World ]
Overparametrization has become a popular technique in deep learning. It is observed that by overparametrization, a larger neural network needs a fewer training iterations than a smaller one to achieve a certain level of performance  namely, overparametrization leads to acceleration in optimization. However, despite that overparametrization is widely used nowadays, little theory is available to explain the acceleration due to overparametrization. In this paper, we propose understanding it by studying a simple problem first. Specifically, we consider the setting that there is a single teacher neuron with quadratic activation, where overparametrization is realized by having multiple student neurons learn the data generated from the teacher neuron. We provably show that overparametrization helps the iterate generated by gradient descent to enter the neighborhood of a global optimal solution that achieves zero testing error faster. 
JunKun Wang · Jacob Abernethy 🔗 


Empirical Study on the Effective VC Dimension of Lowrank Neural Networks
(Poster)
[ Visit Poster at Spot B2 in Virtual World ]
A recent study on overparameterized neural networks by Huh et al. (2021) finds that gradientbased training of overparameterized neural networks finds lowrank parameters, implying that certain implicit lowrank constraints are in action. Inspired by this, we empirically study the effective VC dimension of neural networks with lowrank parameters. Our finding is that their effective VC dimension is proportional to a specific weighted sum of perlayer parameter counts, which we call the effective number of parameters. As the effective VC dimension lower bounds the VC dimension, our result suggests a possibility that the analytic VC dimension upper bound proposed by Bartlett et al. (2019) is indeed tight for neural networks with lowrank parameters. 
Daewon Seo · Hongyi Wang · Dimitris Papailiopoulos · Kangwook Lee 🔗 


Benign Overfitting in Adversarially Robust Linear Classification
(Poster)
[ Visit Poster at Spot B3 in Virtual World ]
``Benign overfitting'', where classifiers memorize noisy training data yet still achieve a good generalization performance, has drawn great attention in the machine learning community. To explain this surprising phenomenon, a series of works have provided theoretical justification in overparameterized linear regression and classification, and kernel methods. However, it is not clear if benign overfitting still occurs in the presence of adversarial examples, which are examples with tiny and intentional perturbations to fool the classifiers. In this paper, we show that benign overfitting indeed occurs in adversarial training, a principled approach to defending against adversarial examples. In detail, we prove the risk bounds of the adversarially trained linear classifier on the mixture of subGaussian data under $\ell_p$ adversarial perturbations. Our result suggests that under moderate perturbations, adversarially trained linear classifiers can achieve the nearoptimal standard and adversarial risks, despite overfitting the noisy training data. Numerical experiments validate our theoretical findings.

Jinghui Chen · Yuan Cao · Yuan Cao · Quanquan Gu 🔗 


Mitigating deep double descent by concatenating inputs
(Poster)
[ Visit Poster at Spot B4 in Virtual World ]
The double descent curve is one of the most intriguing properties of deep neural networks. It contrasts the classical biasvariance curve with the behavior of modern neural networks, occurring where the number of samples nears the number of parameters. In this work, we explore the connection between the double descent phenomena and the number of samples in the deep neural network setting. In particular, we propose a construction which augments the existing dataset by artificially increasing the number of samples. This construction empirically mitigates the double descent curve in this setting. We reproduce existing work on deep double descent, and observe a smooth descent into the overparameterized region for our construction. This occurs both with respect to the model size, and with respect to the number epochs. 
John Chen · Qihan Wang · Anastasios Kyrillidis 🔗 


Robust Generalization of Quadratic Neural Networks via Function Identification
(Poster)
[ Visit Poster at Spot B5 in Virtual World ]
A key challenge facing deep learning is that neural networks are often not robust to shifts in the underlying data distribution. We study this problem from the perspective of the statistical concept of parameter identification. Generalization bounds from learning theory often assume that the test distribution is close to the training distribution. In contrast, if we can identify the ``true'' parameters, then the model generalizes to arbitrary distribution shifts. However, neural networks are typically overparameterized, making parameter identification impossible. We show that for quadratic neural networks, we can identify the function represented by the model even though we cannot identify its parameters. Thus, we can obtain robust generalization bounds even in the overparameterized setting. We leverage this result to obtain new bounds for contextual bandits and transfer learning with quadratic neural networks. Overall, our results suggest that we can improve robustness of neural networks by designing models that can represent the true data generating process. In practice, the true data generating process is often very complex; thus, we study how our framework might connect to neural module networks, which are designed to break down complex tasks into compositions of simpler ones. We prove robust generalization bounds when individual neural modules are identifiable. 
Kan Xu · Hamsa Bastani · Osbert Bastani 🔗 


Label Noise SGD Provably Prefers Flat Global Minimizers
(Poster)
[ Visit Poster at Spot B6 in Virtual World ]
In overparametrized models, the noise in stochastic gradient descent (SGD) implicitly regularizes the optimization trajectory and determines which local minimum SGD converges to. Motivated by empirical studies that demonstrate that training with noisy labels improves generalization, we study the implicit regularization effect of SGD with label noise. We show that SGD with label noise converges to a stationary point of a regularized loss $L(\theta) +\lambda R(\theta)$, where $L(\theta)$ is the training loss, $\lambda$ is an effective regularization parameter, and $R(\theta)$ is an explicit regularizer that penalizes sharp minimizers. Our analysis uncovers an additional regularization effect of large learning rates beyond the linear scaling rule that penalizes large eigenvalues of the Hessian more than small ones. We also prove extensions to classification with general loss functions, SGD with momentum, and SGD with general noise covariance, significantly strengthening the prior work of Blanc et al. to global convergence and large learning rates and of HaoChen et al. to general models.

Alex Damian · Tengyu Ma · Jason Lee 🔗 


On the Origins of the Block Structure Phenomenon in Neural Network Representations
(Poster)
[ Visit Poster at Spot C0 in Virtual World ]
Increasing capacity of neural network architectures by varying their width and depth has been central to their successes. However, recent work has shown that in overparameterized models, the hidden representations exhibit a block structure  a large set of contiguous layers with highly similar representations. In this paper, we investigate how this block structure arises, its connection to the data, and the relationship between training mechanisms and the block structure. We begin by showing that the block structure representations are robust to small outofdistribution shifts in the data. Leveraging insights connecting the block structure and the first principal component of the representations, we then demonstrate that the block structure arises from a small group of examples with similar image statistics. These examples have very large activation norms, and dominate the representational geometry of intermediate network layers. While these "dominant" datapoints are similar across all layers inside the block structure of a single network, different training runs lead to different sets of dominant datapoints. With these insights, we take an interventional approach, introducing a method to regularize the block structure, and also exploring how popular training mechanisms that help with performance can eliminate the block structure in the internal representations of overparameterized models. 
Thao Nguyen · Maithra Raghu · Simon Kornblith 🔗 


Structured Model Pruning of Convolutional Networks on Tensor Processing Units
(Poster)
[ Visit Poster at Spot C1 in Virtual World ]
The deployment of convolutional neural networks is often hindered by high computational and storage requirements. Structured model pruning is a promising approach to alleviate these requirements. Using the VGG16 model as an example, we measure the accuracyefficiency tradeoff for various structured model pruning methods and datasets (CIFAR10 and ImageNet) on Tensor Processing Units (TPUs). To measure the actual performance of models, we develop a structured model pruning library for TensorFlow2 to modify models in place (instead of adding mask layers). We show that structured model pruning can significantly improve model memory usage and speed on TPUs without losing accuracy, especially for small datasets (e.g., CIFAR10). 
Kongtao Chen 🔗 


Benign Overfitting in Multiclass Classification: All Roads Lead to Interpolation
(Poster)
[ Visit Poster at Spot C2 in Virtual World ]
Motivated by the growing literature on ``benign overfitting" in overparameterized models, we study benign overfitting in multiclass linear classification. Specifically, we consider the following popular training algorithms on separable data generated from Gaussian mixtures: (i) empirical risk minimization (ERM) with crossentropy loss, which converges to the multiclass support vector machine (SVM) solution; (ii) ERM with leastsquares loss, which converges to the minnorm interpolating (MNI) solution; and, (iii) the onevsall SVM classifier. Our first key finding is that under a simple sufficient condition, all three algorithms lead to classifiers that interpolate the training data and have equal accuracy. Second, we derive novel error bounds on the accuracy of the MNI classifier, thereby showing that all three training algorithms lead to benign overfitting under sufficient overparameterization. Ultimately, our analysis shows that good generalization is possible for SVM solutions beyond the realm in which typical marginbased bounds apply. 
Ke Wang · Vidya Muthukumar · Christos Thrampoulidis 🔗 


Inductive Bias of MultiChannel Linear Convolutional Networks with Bounded Weight Norm
(Poster)
[ Visit Poster at Spot C3 in Virtual World ]
We study the function space characterization of the inductive bias resulting from controlling the $\ell_2$ norm of the weights in linear convolutional networks. We view this in terms of an *induced regularizer* in the function space given by the minimum norm of weights required to realize a linear function. For two layer linear convolutional networks with $C$ output channels and kernel size $K$, we show the following: (a) If the inputs to the network have a single channel, the induced regularizer for any $K$ is a norm given by a semidefinite program (SDP) that is *independent* of the number of output channels $C$. (b) In contrast, for networks with multichannel inputs, multiple output channels can be necessary to merely realize all matrixvalued linear functions and thus the inductive bias \emph{does} depend on $C$. Further, for sufficiently large $C$, the induced regularizer for $K=1$ and $K=D$ are the nuclear norm and the $\ell_{2,1}$ groupsparse norm, respectively, of the Fourier coefficients. (c) Complementing our theoretical results, we show through experiments on MNIST and CIFAR10 that our key findings extend to implicit biases from gradient descent in overparameterized networks.

Meena Jagadeesan · Ilya Razenshteyn · Suriya Gunasekar 🔗 


Sample Complexity and Overparameterization Bounds for Temporal Difference Learning with Neural Network Approximation
(Poster)
[ Visit Poster at Spot C4 in Virtual World ]
We study the dynamics of temporal difference learning with neural networkbased value function approximation over a general state space, namely, \emph{Neural TD learning}. We consider two practically used algorithms, \emph{projectionfree} and \emph{maxnorm regularized} Neural TD learning, and establish the first convergence bounds for these algorithms. An interesting observation from our results is that maxnorm regularization can dramatically improve the performance of TD learning algorithms, both in terms of sample complexity and overparameterization. In particular, we prove that maxnorm regularization achieves stateoftheart sample complexity and overparameterization bounds by exploiting the geometry of the neural tangent kernel (NTK) function class. The results in this work rely on a novel Lyapunov drift analysis of the network parameters as a stopped and controlled random process. 
Semih Cayci · Siddhartha Satpathi · Niao He · R Srikant 🔗 


Double Descent in Feature Selection: Revisiting LASSO and Basis Pursuit
(Poster)
[ Visit Poster at Spot C5 in Virtual World ]
We present a novel analysis of feature selection in linear models by the convex framework of least absolute shrinkage operator (LASSO) and basis pursuit (BP). Our analysis pertains to a general overparametrized scenario. When the numbers of the features and the data samples grow proportionally, we obtain precise expressions for the asymptotic generalization error of LASSO and BP. Considering a mixture of strong and weak features, we provide insights into regularization tradeoffs for double descent for $\ell_1$ norm minimization. We validate these results with numerical experiments.

David Bosch · Ashkan Panahi · Ayca Ozcelikkale 🔗 


On Low Rank Training of Deep Neural Networks
(Poster)
[ Visit Poster at Spot C6 in Virtual World ]
Training deep neural networks in low rank, i.e. with factorised layers, is of particular interest to the community: it offers efficiency over unfactorised training in terms of both memory consumption and training time. Prior work has focused on low rank approximations of pretrained networks and training in low rank space with additional objectives, offering various ad hoc explanations for chosen practice. We analyse techniques that work well in practice, and through extensive ablations on models such as GPT2 we provide evidence falsifying common beliefs in the field, hinting in the process at exciting research opportunities that still need answering. 
Siddhartha Kamalakara · Acyr Locatelli · Bharat Venkitesh · Jimmy Ba · Yarin Gal · Aidan Gomez 🔗 


On the Sparsity of Deep Neural Networks in the Overparameterized Regime: An Empirical Study
(Poster)
[ Visit Poster at Spot D0 in Virtual World ]
Sparsity and lowrank structures have been incorporated into neural networks to reduce computational complexity and to improve generalization and robustness. Recent theoretical developments show that both are natural characteristics of datafitting solutions cast in a new family of Banach spaces referred to as RBV2 spaces, the spaces of secondorder bounded variation in the Radon domain. Moreover, sparse and deep ReLU networks are solutions to infinite dimensional variational problems in compositions of these spaces. This means that these learning problems can be recast as parametric optimizations over neural network weights. Remarkably, standard weight decay and variants correspond exactly to regularizing the RBV2norm in the function space. Empirical validation in this paper confirm that weight decay leads to sparse and lowrank networks, as predicted by the theory. 
Rahul Parhi · Jack Wolf · Robert Nowak 🔗 


Implicit Acceleration and Feature Learning in Infinitely Wide Neural Networks with Bottlenecks
(Poster)
[ Visit Poster at Spot D1 in Virtual World ]
We analyze the learning dynamics of infinitely wide neural networks with a finite sized bottleneck. Unlike the neural tangent kernel limit, a bottleneck in an otherwise infinite width network allows data dependent feature learning in its bottleneck representation. We empirically show that a single bottleneck in infinite networks dramatically accelerates training when compared to purely infinite networks, with an improved overall performance. We discuss the acceleration phenomena by drawing similarities to infinitely wide deep linear models, where the acceleration effect of a bottleneck can be understood theoretically. 
Etai Littwin · Omid Saremi · Shuangfei Zhai · Vimal Thilak · Hanlin Goh · Joshua M Susskind · Greg Yang 🔗 


Classification and Adversarial Examples in an Overparameterized Linear Model: A SignalProcessing Perspective
(Poster)
[ Visit Poster at Spot D2 in Virtual World ]
In practice, classification models that generalize well are often susceptible to adversarial perturbations. We illustrate a novel estimationcentric explanation of adversarial susceptibility using an overparameterized linear model on lifted Fourier features. We show that the min 2norm interpolator of training data can be susceptible even to adversaries who can only perturb the lowdimensional inputs and not the highdimensional lifted features directly. The adversarial vulnerability arises because of a phenomena we term spatial localization: the predictions of the learned model are markedly more sensitive in the vicinity of training points than elsewhere. This sensitivity is crucially a consequence of feature lifting and can have consequences reminiscent of Gibb’s and Runge’s phenomena from signal processing and functional analysis. Despite the adversarial susceptibility, we find that classification using spatially localized features can be “easier” i.e. less sensitive to the strength of the prior than in independent feature setups. Our findings are replicated theoretically for a randomfeature setup that exhibits doubledescent behavior, and empirically for polynomial features. 
Adhyyan Narang · Vidya Muthukumar · Anant Sahai 🔗 


Gradient Starvation: A Learning Proclivity in Neural Networks
(Poster)
[ Visit Poster at Spot A1 in Virtual World ]
We identify and formalize a fundamental gradient descent phenomenon leading to a learning proclivity in overparameterized neural networks. Gradient Starvation arises when crossentropy loss is minimized by capturing only a subset of features relevant for the task, despite the presence of other predictive features that fail to be discovered. This work provides a theoretical explanation for the emergence of such feature imbalances in neural networks. Using tools from Dynamical Systems theory, we identify simple properties of learning dynamics during gradient descent that lead to this imbalance, and prove that such a situation can be expected given certain statistical structure in training data. Based on our proposed formalism, we develop guarantees for a novel but simple regularization method aimed at decoupling feature learning dynamics, improving accuracy and robustness in cases hindered by gradient starvation. We illustrate our findings with simple and realworld outofdistribution (OOD) generalization experiments. 
Mohammad Pezeshki · SékouOumar Kaba · Yoshua Bengio · Aaron Courville · Doina Precup · Guillaume Lajoie 🔗 


Studying the Consistency and Composability of Lottery Ticket Pruning Masks
(Poster)
[ Visit Poster at Spot A2 in Virtual World ]
Magnitude pruning is a common, effective technique to identify sparse subnetworks at little cost to accuracy. In this work, we ask whether a particular architecture's accuracysparsity tradeoff can be improved by combining pruning information across multiple runs of training. From a shared ResNet20 initialization, we train several network copies (\textit{siblings}) to completion using different SGD data orders on CIFAR10. While the siblings' pruning masks are naively not much more similar than chance, starting sibling training after a few epochs of shared pretraining significantly increases pruning overlap. We then choose a subnetwork by either (1) taking all weights that survive pruning in any sibling (mask union), or (2) taking only the weights that survive pruning across all siblings (mask intersection). The resulting subnetwork is retrained. Strikingly, we find that union and intersection masks perform very similarly. Both methods match the accuracysparsity tradeoffs of the oneshot magnitude pruning baseline, even when we combine masks from up to $k = 10$ siblings.

Rajiv Movva · Michael Carbin · Jonathan Frankle 🔗 


EpochWise Double Descent: A Theory of Multiscale Feature Learning Dynamics
(Poster)
[ Visit Poster at Spot A3 in Virtual World ]
A key challenge in building theoretical foundations for deep learning is their complex optimization dynamics. Such dynamics result from highdimensional interactions between parameters, leading to nontrivial behaviors. In this regard, a particularly puzzling phenomenon is the ``double descent'' of the generalization error with increasing model complexity (modelwise) or training time (epochwise). While modelwise double descent has been a subject of extensive study of recent, the origins of the latter are much less clear. To bridge this gap, in this work, we leverage tools from statistical physics to study a simple teacherstudent setup exhibiting epochwise double descent similar to deep neural networks. In this setting, we derive closedform analytical expressions for the evolution of generalization error as a function of the training time. Crucially, this provides a new mechanistic explanation of epochwise double descent, suggesting that it can be attributed to features being learned at different time scales. Summarily, while a fastlearning feature is overfitted, a slowerlearning feature starts to fit, resulting in a nonmonotonous generalization curve. 
Mohammad Pezeshki · Amartya Mitra · Yoshua Bengio · Guillaume Lajoie 🔗 


Implicit Greedy Rank Learning in Autoencoders via Overparameterized Linear Networks
(Poster)
[ Visit Poster at Spot A4 in Virtual World ]
Deep linear networks trained with gradient descent yield low rank solutions, as is typically studied in matrix factorization. In this paper, we take a step further and analyze implicit rank regularization in autoencoders. We show greedy learning of lowrank latent codes induced by a linear subnetwork at the autoencoder bottleneck. We further propose orthogonal initialization and principled learning rate adjustment to mitigate sensitivity of training dynamics to spectral prior and linear depth. With linear autoencoders on synthetic data, our method converges stably to groundtruth latent code rank. With nonlinear autoencoders, our method converges to latent ranks optimal for downstream classification and image sampling. 
ShihYu Sun · Vimal Thilak · Etai Littwin · Omid Saremi · Joshua M Susskind 🔗 


Assessing Generalization of SGD via Disagreement Rates
(Poster)
[ Visit Poster at Spot A5 in Virtual World ]
We empirically show that the test error of deep networks can be estimated simply by training the same architecture on the same training set but with a different run of SGD, and measuring the disagreement rate between the two networks on unlabeled test data. This builds on  and is a stronger version of  the observation in Nakirran & Bansal (20), which requires the second run to be on an altogether fresh training set. We further theoretically show that this peculiar phenomenon arises from the wellcalibrated nature of ensembles of SGDtrained models. This finding not only provides a simple empirical measure to directly predict the test error using unlabeled test data, but also establishes a new conceptual connection between generalization and calibration. 
YiDing Jiang · Vaishnavh Nagarajan · Zico Kolter 🔗 


Risk Bounds for Overparameterized Maximum Margin Classification on SubGaussian Mixtures
(Poster)
[ Visit Poster at Spot A6 in Virtual World ]
Modern machine learning systems such as deep neural networks are often highly overparameterized so that they can fit the noisy training data exactly, yet they can still achieve small test errors in practice. In this paper, we study this ``benign overfitting'' phenomenon of the maximum margin classifier for linear classification problems. Specifically, we consider data generated from subGaussian mixtures, and provide a tight risk bound for the maximum margin linear classifier in the overparameterized setting. Our results precisely characterize the condition under which benign overfitting can occur in linear classification problems, and improve on previous work. They also have direct implications for overparameterized logistic regression. 
Yuan Cao · Yuan Cao · Quanquan Gu · Mikhail Belkin 🔗 


Rethinking compactness in deep neural networks
(Poster)
[ Visit Poster at Spot B0 in Virtual World ]
Deep neural networks are a type of overparameterized models which are able to achieve high performance despite having typically more parameters than training samples. Recently, there has been an increasing interest in uncovering and understanding the different phenomena that occur in the overparameterized regime induced by such neural networks. In this paper, we aim to shed light on the relationship between class compactness of the learned feature representations and the model performance. Surprisingly, we find that models that learn more classinvariant features do not necessarily perform better. Moreover, we show that during training, classwise variance increases and the models learn a less compact and more outspread representation of the classes. 
Kateryna Chumachenko · Firas Laakom · Jenni Raitoharju · Alexandros Iosifidis · Moncef Gabbouj 🔗 


Overfitting of Polynomial Regression with Overparameterization
(Poster)
[ Visit Poster at Spot B1 in Virtual World ]
We apply the theory introduced in \cite{jacot2020kernel} to study the risk (i.e. generalization error) in a polynomial regression setup where the data points $ x_i $'s are i.i.d. samples from the uniform distribution on $[1,1]$ and the observations $ y_i = f^*(x_i) + \epsilon e_i $ are generated by a continuous true function $ f^* $ where $ e_i $'s are standard Gaussian noise and $ \epsilon $ is the noise level. In our setup, we can precisely compute the Signal Capture Threshold (SCT) as a function of the number of polynomial features $ P+1 $, the number of samples $ N $, and the ridge $ \lambda > 0 $ which enables a precise analysis of the risk and an explanation of the overfitting of polynomial features with overparameterization.

Hugo Fabregues · Berfin Simsek 🔗 


On the memorization properties of contrastive learning
(Poster)
[ Visit Poster at Spot B2 in Virtual World ]
Memorization studies of deep neural networks (DNNs) help to understand what patterns and how do DNNs learn, and motivate improvements to DNN training approaches. In this work, we investigate the memorization properties of SimCLR, a widely used contrastive selfsupervised learning approach, and compare them to the memorization of supervised learning and random labels training. We find that both training objects and augmentations may have different complexity in the sense of how SimCLR learns them. Moreover, we show that SimCLR is similar to random labels training in terms of the distribution of training objects complexity. 
Ildus Sadrtdinov · Nadezhda Chirkova · Ekaterina Lobacheva 🔗 


OverParameterization and Generalization in Audio Classification
(Poster)
[ Visit Poster at Spot B3 in Virtual World ]
Convolutional Neural Networks (CNNs) have been dominating classification tasks in various domains, such as machine vision, machine listening, and natural language processing. In machine listening, while generally exhibiting very good generalization capabilities, CNNs are sensitive to the specific audio recording device used, which has been recognized as a substantial problem in the acoustic scene classification (DCASE) community. In this study, we investigate the relationship between overparameterization of acoustic scene classification models, and their resulting generalization abilities. Specifically, we test scaling CNNs in width and depth, under different conditions. Our results indicate that increasing width improves generalization to unseen devices, even without an increase in the number of parameters. 
Khaled Koutini · Khaled Koutini · Hamid Eghbalzadeh · Florian Henkel · Jan Schlüter · Gerhard Widmer 🔗 


Surprising benefits of ridge regularization for noiseless regression
(Poster)
[ Visit Poster at Spot B4 in Virtual World ]
Numerous recent works show that overparameterization implicitly reduces variance for minimumnorm interpolators, suggesting vanishing benefits for ridge regularization in high dimensions. However, empirical findings suggest that this narrative may not hold true for robust generalization. In this paper we reveal that for overparameterized linear regression, the robust risk is minimized for a positive regularization coefficient even when the training data is noiseless. Hence, we effectively provide, to the best of our knowledge, the first theoretical analysis on the phenomenon of robust overfitting. 
Konstantin Donhauser · Alexandru Tifrea · Michael Aerni · Reinhard Heckel · Fanny Yang 🔗 


Binary Classification of Gaussian Mixtures: Abundance of Support Vectors, Benign Overfitting and Regularization
(Poster)
[ Visit Poster at Spot B5 in Virtual World ]
Deep neural networks generalize well despite being exceedingly overparameterized and being trained without explicit regularization. This curious phenomenon, often termed benign overfitting, has inspired extensive research activity in establishing its statistical principles. In this work, we study both maxmargin SVM and minnorm interpolating classifiers. First, we leverage an idea introduced in [V. Muthukumar et al., arXiv:2005.08054, (2020)] to relate the SVM solution to the leastsquares (LS) interpolating solution. Second, we derive nonasymptotic bounds on the classification error of the LS solution. Combining the two, we present sufficient conditions on the overparameterization ratio and on the signaltonoise ratio (SNR) for benign overfitting to occur. Moreover, we investigate the role of regularization and identify precise conditions under which the interpolating estimator performs better than the regularized estimates. We corroborate our theoretical findings with numerical simulations. 
Ke Wang · Christos Thrampoulidis 🔗 


LabelImbalanced and GroupSensitive Classification under Overparameterization
(Poster)
[ Visit Poster at Spot B6 in Virtual World ]
The goal in labelimbalanced and groupsensitive classification is to optimize metrics such as balanced error and equal opportunity. Classical methods like reweighted crossentropy, are known to fail when used with the modern practice of training deep nets to the terminal phase of training (TPT), that is training beyond zero training error. In contrast to previous heuristics, we follow a principled analysis explaining how different loss adjustments affect margins. First, we prove that for linear classifiers trained in TPT, it is necessary to introduce multiplicative, rather than additive, logit adjustments so that the relative margins between classes change appropriately. To show this, we discover a connection of the multiplicative CE modification to the costsensitive supportvector machines. While additive adjustments are ineffective deep in the TPT, we show numerically that they can speed up convergence by countering an initial negative effect of the multiplicative weights. Motivated by these findings, we formulate the vectorscaling (VS) loss, that captures existing techniques as special cases. For Gaussianmixtures data, we perform a generalization analysis, revealing tradeoffs between balanced / standard error and equal opportunity. 
Ganesh Ramachandra Kini · Orestis Paraskevas · Samet Oymak · Christos Thrampoulidis 🔗 


Earlystopped neural networks are consistent
(Poster)
[ Visit Poster at Spot C0 in Virtual World ]
This work studies the behavior of neural networks trained with the logistic loss via gradient descent on binary classification data where the underlying data distribution is general, and the (optimal) Bayes risk is not necessarily zero. In this setting, it is shown that gradient descent with early stopping achieves population risk arbitrarily close to optimal in terms of not just logistic and misclassification losses, but also in terms of calibration, meaning the sigmoid mapping of its outputs approximates the true underlying conditional distribution arbitrarily finely. Moreover, the necessary iteration, sample, and architectural complexities of this analysis all scale naturally with a certain complexity measure of the true conditional model. Lastly, while it is not shown that early stopping is necessary, it is shown that any univariate classifier satisfying a local interpolation property is necessarily inconsistent. 
Ziwei Ji · Matus Telgarsky 🔗 


Distributional Generalization: A New Kind of Generalization (Extended Abstract)
(Poster)
[ Visit Poster at Spot C1 in Virtual World ]
We discover a new set of empirical properties of interpolating classifiers, including neural networks, kernel machines and decision trees. Informally, the output distribution of an interpolating classifier matches the distribution of true labels, when conditioned on certain subgroups of the input space. For example, if we mislabel 30% of images of dogs as cats in the train set of CIFAR10, then a ResNet trained to interpolation will in fact mislabel roughly 30% of dogs as cats on the test set as well, while leaving other classes unaffected. These behaviors are not captured by classical generalization, which would only consider the average error over the inputs, and not where these errors occur. We introduce and experimentally validate a formal conjecture that specifies the subgroups for which we expect this distributional closeness. Further, we show that these properties can be seen as a new form of generalization, which advances our understanding of the implicit bias of interpolating methods. 
Preetum Nakkiran · Yamini Bansal 🔗 


Feature Learning in InfiniteWidth Neural Networks
(Poster)
[ Visit Poster at Spot C2 in Virtual World ]
As its width tends to infinity, a deep neural network's behavior under gradient descent can become simplified and predictable (e.g.\ given by the Neural Tangent Kernel (NTK)), if it is parametrized appropriately (e.g.\ the NTK parametrization). However, we show that the standard and NTK parametrizations of a neural network do not admit infinitewidth limits that can \emph{learn} features, which is crucial for pretraining and transfer learning such as with BERT. We propose simple modifications to the standard parametrization to allow for feature learning in the limit. Using the \emph{Tensor Programs} technique, we derive explicit formulas for such limits. On Word2Vec and fewshot learning on Omniglot via MAML, two canonical tasks that rely crucially on feature learning, we compute these limits exactly. We find that they outperform both NTK baselines and finitewidth networks, with the latter approaching the infinitewidth feature learning performance as width increases. 
Greg Yang · Edward Hu 🔗 


On the Generalization Improvement from Neural Network Pruning
(Poster)
[ Visit Poster at Spot C3 in Virtual World ]
Even though the goal of pruning is often to reduce the computational resources consumed during training or inference, it comes as no surprise to theoreticians or practitioners that pruning also improves generalization. In this work, we empirically study pruning's effect on generalization, focusing on two stateoftheart pruning algorithms: weight rewinding and learningrate rewinding. However, each pruning algorithm is actually an aggregation of many design choices: a weight scoring heuristic, a pruning schedule, a learning rate schedule, among other factors, each of which might contribute to generalization improvement in different ways. We thus ablate each design choice to determine whether it is responsible for pruning's effect on generalization. We find that each individual contribution is limited compared to the generalization improvement achieved with the pruning algorithm in its entirety. Our results also highlight similarities and differences between the effects on generalization caused by pruning and modelwidth scaling. 
Tian Jin · Gintare Karolina Dziugaite · Michael Carbin 🔗 


A Universal Law of Robustness via Isoperimetry
(Poster)
[ Visit Poster at Spot C4 in Virtual World ]
Classically, data interpolation with a parametrized model class is possible as long as the number of parameters is larger than the number of equations to be satisfied. A puzzling phenomenon in deep learning is that models are trained with many more parameters than what this classical theory would suggest. We propose a theoretical explanation for this phenomenon. We prove that for a broad class of data distributions and model classes, overparametrization is necessary if one wants to interpolate the data smoothly. Namely we show that smooth interpolation requires $d$ times more parameters than mere interpolation, where $d$ is the ambient data dimension. We prove this universal law of robustness for any smoothly parametrized function class with polynomial size weights, and any covariate distribution verifying isoperimetry (or a mixture thereof). In the case of twolayer neural networks and Gaussian covariates, this law was conjectured in prior work by Bubeck, Li and Nagaraj.

Sebastien Bubeck · Mark Sellke 🔗 


Understanding the effect of sparsity on neural networks robustness
(Poster)
[ Visit Poster at Spot C5 in Virtual World ]
This paper examines the impact of static sparsity on the robustness of a trained network to weight perturbations, data corruption, and adversarial examples. We show that, up to a certain sparsity achieved by increasing network width and depth while keeping the network capacity fixed, sparsified networks consistently match and often outperform their initially dense versions. Robustness and accuracy decline simultaneously for very high sparsity due to loose connectivity between network layers. Our findings show that a rapid robustness drop caused by network compression observed in the literature is due to a reduced network capacity rather than sparsity. 
Lukas Timpl · Rahim Entezari · Hanie Sedghi · Behnam Neyshabur · Olga Saukh 🔗 


Beyond Implicit Regularization: Avoiding Overfitting via Regularizer Mirror Descent
(Poster)
[ Visit Poster at Spot C6 in Virtual World ]
It is widely recognized that, despite perfectly interpolating the training data, deep neural networks (DNNs) can still generalize well due in part to the ``implicit regularization'' induced by the learning algorithm. Nonetheless, ``explicit regularization'' (or weight decay) is often used to avoid overfitting, especially when the data is known to be corrupted. There are several challenges with using explicit regularization, most notably unclear convergence properties. In this paper, we propose a novel variant of the stochastic mirror descent (SMD) algorithm, called \emph{regularizer mirror descent (RMD)}, for training DNNs. The starting point for RMD is a cost which is the sum of the training loss and any convex regularizer of the network weights. For highlyoverparameterized models, RMD provably converges to a point ``close'' to the optimal solution of this cost. The algorithm imposes virtually no additional computational burden compared to stochastic gradient descent (SGD) or weight decay, and is parallelizable in the same manner that they are. Our experimental results on training sets that contain some errors suggest that, in terms of generalization performance, RMD outperforms both SGD, which implicitly regularizes for the $\ell_2$ norm of the weights, and weight decay, which explicitly does so. This makes RMD a viable option for training with regularization in DNNs. In addition, RMD can be used for regularizing the weights to be close to a desired weight vector, which is particularly important for continual learning.

Navid Azizan · Sahin Lale · Babak Hassibi 🔗 


ValueBased Deep Reinforcement Learning Requires Explicit Regularization
(Poster)
[ Visit Poster at Spot D0 in Virtual World ]
While deep reinforcement learning (RL) methods present an appealing approach to sequential decisionmaking, such methods are often unstable in practice. What accounts for this instability? Recent theoretical analysis of overparameterized supervised learning with stochastic gradient descent shows that learning is driven by an implicit regularizer once it reaches the zero training error regime, which results in “simpler” functions that generalize. However, in this paper, we show that in the case of deep RL, implicit regularization can instead lead to degenerate features by characterizing the induced implicit regularizer in the semigradient TD learning setting in RL with a fixeddataset (i.e., offline RL). The regularize recapitulates recent empirical findings regarding the rank collapse of learned features and provides an understanding for its cause. To address the adverse impacts of this implicit regularization, we propose a simple and effective explicit regularizer for TD learning, DR3. DR3 minimizes the similarity of learned features of the Qnetwork at consecutive stateaction tuples in the TD update. Empirically, when combined with existing offline RL methods, DR3 substantially improves both performance and stability on Atari 2600 games, D4RL domains, and robotic manipulation from images. 
Aviral Kumar · Rishabh Agarwal · Aaron Courville · Tengyu Ma · George Tucker · Sergey Levine 🔗 


Towards understanding how momentum improves generalization in deep learning
(Poster)
[ Visit Poster at Spot D1 in Virtual World ]
Stochastic gradient descent (SGD) with momentum is widely used for training modern deep learning architectures. While it is well understood that using momentum can lead to faster convergence rate in various settings, it has also been empirically observed that adding momentum yields higher generalization. This paper formally studies how momentum help generalization in deep learning: 
Samy Jelassi · Yuanzhi Li 🔗 