Track: Deep Learning Theory

Thu 13 June 11:00 - 11:20 PDT

Why do Larger Models Generalize Better? A Theoretical Perspective via the XOR Problem

Alon Brutzkus · Amir Globerson

Empirical evidence suggests that neural networks with ReLU activations generalize better with over-parameterization. However, there is currently no theoretical analysis that explains this observation. In this work, we provide theoretical and empirical evidence that, in certain cases, overparameterized convolutional networks generalize better than small networks because of an interplay between weight clustering and feature exploration at initialization. We demonstrate this theoretically for a 3-layer convolutional neural network with max-pooling, in a novel setting which extends the XOR problem. We show that this interplay implies that with overparamterization, gradient descent converges to global minima with better generalization performance compared to global minima of small networks. Empirically, we demonstrate these phenomena for a 3-layer convolutional neural network in the MNIST task.

Thu 13 June 11:20 - 11:25 PDT

On the Spectral Bias of Neural Networks

Nasim Rahaman · Aristide Baratin · Devansh Arpit · Felix Draxler · Min Lin · Fred Hamprecht · Yoshua Bengio · Aaron Courville

Neural networks are known to be a class of highly expressive functions able to fit even random input-output mappings with 100% accuracy. In this work we present properties of neural networks that complement this aspect of expressivity. By using tools from Fourier analysis, we highlight a learning bias of deep networks towards low frequency functions -- i.e. functions that vary globally without local fluctuations -- which manifests itself as a frequency-dependent learning speed. Intuitively, this property is in line with the observation that over-parameterized networks prioritize learning simple patterns that generalize across data samples. We also investigate the role of the shape of the data manifold by presenting empirical and theoretical evidence that, somewhat counter-intuitively, learning higher frequencies gets easier with increasing manifold complexity.

Thu 13 June 11:25 - 11:30 PDT

Recursive Sketches for Modular Deep Learning

Badih Ghazi · Rina Panigrahy · Joshua R. Wang

We present a mechanism to compute a sketch (succinct summary) of how a complex modular deep network processes its inputs. The sketch summarizes essential information about the inputs and outputs of the network and can be used to quickly identify key components and summary statistics of the inputs. Furthermore, the sketch is recursive and can be unrolled to identify sub-components of these components and so forth, capturing a potentially complicated DAG structure. These sketches erase gracefully; even if we erase a fraction of the sketch at random, the remainder still retains the high-weight'' information present in the original sketch. The sketches can also be organized in a repository to implicitly form aknowledge graph''; it is possible to quickly retrieve sketches in the repository that are related to a sketch of interest. Arranged in this fashion, the sketches can also be used to learn emerging concepts by looking for new clusters in sketch space. Finally, in the scenario where we want to learn a ground truth deep network, we show that augmenting input/output pairs with these sketches can theoretically make it easier to do so.

Thu 13 June 11:30 - 11:35 PDT

Zero-Shot Knowledge Distillation in Deep Networks

Gaurav Kumar Nayak · Konda Reddy Mopuri · Vaisakh Shaj · Venkatesh Babu Radhakrishnan · Anirban Chakraborty

Knowledge distillation deals with the problem of training a smaller model from a high capacity model so as to retain most of its performance. The source and target model are generally referred to as Teacher and Student model respectively. Existing approaches use either the training data or meta-data extracted from it in order to train the Student. However, accessing the dataset on which the Teacher has been trained may not always be feasible if the dataset is very large or it poses privacy or safety concerns (e.g., biometric or medical data). Therefore, in this paper, we propose a novel data-free method to train the Student from the Teacher. Without even utilizing any meta-data, we extract the Data Impressions from the parameters of the Teacher model and utilize these as surrogate for the original training data samples to transfer its learning to Student via knowledge distillation. Hence we dub our method "Zero-shot Knowledge Distillation". We demonstrate that our framework results in competitive generalization performance as achieved by the actual training data samples on multiple benchmark datasets.

Thu 13 June 11:35 - 11:40 PDT

A Convergence Theory for Deep Learning via Over-Parameterization

Zeyuan Allen-Zhu · Yuanzhi Li · Zhao Song

Deep neural networks (DNNs) have demonstrated dominating performance in many fields; since AlexNet, the neural networks used in practice are going wider and deeper. On the theoretical side, a long line of works have been focusing on why we can train neural networks when there is only one hidden layer. The theory of multi-layer networks remains somewhat unsettled. In this work, we prove why simple algorithms such as stochastic gradient descent (SGD) can find GLOBAL MINIMA on the training objective of DNNs in POLYNOMIAL TIME. We only make two assumptions: the inputs do not degenerate and the network is over-parameterized. The latter means the number of hidden neurons is sufficiently large: polynomial in $L$, the number of DNN layers and in $n$, the number of training samples. As concrete examples, on the training set and starting from randomly initialized weights, we show that SGD attains 100\% accuracy in classification tasks, or minimizes regression loss in linear convergence speed $\varepsilon \propto e^{-\Omega(T)}$, with a number of iterations that only scales polynomial in $n$ and $L$. Our theory applies to the widely-used but non-smooth ReLU activation, and to any smooth and possibly non-convex loss functions. In terms of network architectures, our theory at least applies to fully-connected neural networks, convolutional neural networks (CNN), and residual neural networks (ResNet).

Thu 13 June 11:40 - 12:00 PDT

A Tail-Index Analysis of Stochastic Gradient Noise in Deep Neural Networks

Umut Simsekli · Levent Sagun · Mert Gurbuzbalaban

The gradient noise (GN) in the stochastic gradient descent (SGD) algorithm is often considered to be Gaussian in the large data regime by assuming that the classical central limit theorem (CLT) kicks in. This assumption is often made for mathematical convenience, since it enables SGD to be analyzed as a stochastic differential equation (SDE) driven by a Brownian motion. We argue that the Gaussianity assumption might fail to hold in deep learning settings and hence render the Brownian motion-based analyses inappropriate. Inspired by non-Gaussian natural phenomena, we consider the GN in a more general context and invoke the generalized CLT (GCLT), which suggests that the GN converges to a heavy-tailed $\alpha$-stable random variable. Accordingly, we propose to analyze SGD as an SDE driven by a L\'{e}vy motion. Such SDEs can incur `jumps', which force the SDE transition from narrow minima to wider minima, as proven by existing metastability theory. To validate the $\alpha$-stable assumption, we conduct experiments on common deep learning scenarios and show that in all settings, the GN is highly non-Gaussian and admits heavy-tails. We investigate the tail behavior in varying network architectures and sizes, loss functions, and datasets. Our results open up a different perspective and shed more light on the belief that SGD prefers wide minima.

Thu 13 June 12:00 - 12:05 PDT

Approximation and non-parametric estimation of ResNet-type convolutional neural networks

Kenta Oono · Taiji Suzuki

Convolutional neural networks (CNNs) have been shown to achieve optimal approximation and estimation error rates (in minimax sense) in several function classes. However, previous known optimal CNNs are unrealistically wide and difficult to obtain via optimization due to sparse constraints in important function classes, including the H\"older class. We show a ResNet-type CNN can attain the minimax optimal error rates in these classes in more plausible situations -- it can be dense, and its width, channel size, and filter size are constant with respect to sample size. The key idea is that we can replicate the learning ability of Fully-connected neural networks (FNNs) by tailored CNNs, as long as the FNNs have \textit{block-sparse} structures. Our theory is general in that we can automatically translate any approximation rate achieved by block-sparse FNNs into that by CNNs. As an application, we derive approximation and estimation error rates of the aformentioned type of CNNs for the Barron and H\"older classes with the same strategy.

Thu 13 June 12:05 - 12:10 PDT

Global Convergence of Block Coordinate Descent in Deep Learning

Jinshan ZENG · Tim Tsz-Kit Lau · Shaobo Lin · Yuan Yao

Deep learning has aroused extensive attention due to its great empirical success. The efficiency of the block coordinate descent (BCD) methods has been recently demonstrated in deep neural network (DNN) training. However, theoretical studies on their convergence properties are limited due to the highly nonconvex nature of DNN training. In this paper, we aim at providing a general methodology for provable convergence guarantees for this type of methods. In particular, for most of the commonly used DNN training models involving both two- and three-splitting schemes, we establish the global convergence to a critical point at a ${\cal O}(1/k)$ rate, where $k$ is the number of iterations. The results extend to general loss functions which have Lipschitz continuous gradients and deep residual networks (ResNets). Our key development adds several new elements to the Kurdyka-{\L}ojasiewicz inequality framework that enables us to carry out the global convergence analysis of BCD in the general scenario of deep learning.

Thu 13 June 12:10 - 12:15 PDT

Measurements of Three-Level Hierarchical Structure in the Outliers in the Spectrum of Deepnet Hessians

Vardan Papyan

We consider deep classifying neural networks. We expose a structure in the derivative of the logits with respect to the parameters of the model, which is used to explain the existence of outliers in the spectrum of the Hessian. Previous works decomposed the Hessian into two components, attributing the outliers to one of them, the so-called Covariance of gradients. We show this term is not a Covariance but a second moment matrix, i.e., it is influenced by means of gradients. These means possess an additive two-way structure that is the source of the outliers in the spectrum. This structure can be used to approximate the principal subspace of the Hessian using certain "averaging" operations, avoiding the need for high-dimensional eigenanalysis. We corroborate this claim across different datasets, architectures and sample sizes.

Thu 13 June 12:15 - 12:20 PDT

On the Limitations of Representing Functions on Sets

Edward Wagstaff · Fabian Fuchs · Martin Engelcke · Ingmar Posner · Michael A Osborne

Recent work on the representation of functions on sets has considered the use of summation in a latent space to enforce permutation invariance. In particular, it has been conjectured that the dimension of this latent space may remain fixed as the cardinality of the sets under consideration increases. However, we demonstrate that the analysis leading to this conjecture requires mappings which are highly discontinuous and argue that this is only of limited practical use. Motivated by this observation, we prove that an implementation of this model via continuous mappings (as provided by e.g. neural networks or Gaussian processes) actually imposes a constraint on the dimensionality of the latent space. Practical universal function representation for set inputs can only be achieved with a latent dimension at least the size of the maximum number of input elements.

Main Navigation

Session

Deep Learning Theory