Track: Deep Learning Theory

Tue 11 June 14:00 - 14:20 PDT

On Learning Invariant Representations for Domain Adaptation

Han Zhao · Remi Tachet des Combes · Kun Zhang · Geoff Gordon

Due to the ability of deep neural nets to learn rich representations, recent advances in unsupervised domain adaptation have focused on learning domain-invariant features that achieve a small error on the source domain. The hope is that the learnt representation, together with the hypothesis learnt from the source domain, can generalize to the target domain. In this paper, we first construct a simple counterexample showing that, contrary to common belief, the above conditions are not sufficient to guarantee successful domain adaptation. In particular, the counterexample (Fig.~\ref{fig:example}) exhibits \emph{conditional shift}: the class-conditional distributions of input features change between source and target domains. To give a sufficient condition for domain adaptation, we propose a natural and interpretable generalization upper bound that explicitly takes into account the aforementioned shift. Moreover, we shed new light on the problem by proving an information-theoretic lower bound on the joint error of \emph{any} domain adaptation method that attempts to learn invariant representations. Our result characterizes a fundamental tradeoff between learning invariant representations and achieving small joint error on both domains when the marginal label distributions differ from source to target. Finally, we conduct experiments on real-world datasets that corroborate our theoretical findings. We believe these insights are helpful in guiding the future design of domain adaptation and representation learning algorithms.

Tue 11 June 14:20 - 14:25 PDT

Lexicographic and Depth-Sensitive Margins in Homogeneous and Non-Homogeneous Deep Models

Mor Shpigel Nacson · Suriya Gunasekar · Jason Lee · Nati Srebro · Daniel Soudry

With an eye toward understanding complexity control in deep learning, we study how infinitesimal regularization or gradient descent optimization lead to margin maximizing solutions in both homogeneous and non homogeneous models, extending previous work that focused on infinitesimal regularization only in homogeneous models. To this end we study the limit of loss minimization with a diverging norm constraint (the constrained path''), relate it to the limit of amargin path'' and characterize the resulting solution. For non-homogeneous models we show that this solution is biased toward the deepest part of the model, discarding the shallowest parts if they are unnecessary. For homogeneous models, we show convergence to a ``lexicographic max margin solution'', and provide conditions under which max margin solutions are also attained as the limit of unconstrained gradient descent.

Tue 11 June 14:25 - 14:30 PDT

Adversarial Generation of Time-Frequency Features with application in audio synthesis

Andrés Marafioti · Nathanaël Perraudin · Nicki Holighaus · Piotr Majdak

Time-frequency (TF) representations provide powerful and intuitive features for the analysis of time series such as audio. But still, generative modeling of audio in the TF domain is a subtle matter. Consequently, neural audio synthesis widely relies on directly modeling the waveform and previous attempts at unconditionally synthesizing audio from neurally generated TF features still struggle to produce audio at satisfying quality. In this contribution, focusing on the short-time Fourier transform, we discuss the challenges that arise in audio synthesis based on generated TF features and how to overcome them. We demonstrate the potential of deliberate generative TF modeling by training a generative adversarial network (GAN) on short-time Fourier features. We show that our TF-based network was able to outperform the state-of-the-art GAN generating waveform, despite the similar architecture in the two networks.

Tue 11 June 14:30 - 14:35 PDT

On the Universality of Invariant Networks

Haggai Maron · Ethan Fetaya · Nimrod Segol · Yaron Lipman

Constraining linear layers in neural networks to respect symmetry transformations from a group $G$ is a common design principle for invariant networks that has found many applications in machine learning. In this paper, we consider a fundamental question that has received very little attention to date: Can these networks approximate any (continuous) invariant function? We tackle the rather general case where $G\leq S_n$ (an arbitrary subgroup of the symmetric group) that acts on $\R^n$ by permuting coordinates. This setting includes several recent popular invariant networks. We present two main results: First, $G$-invariant networks are universal if high-order tensors are allowed. Second, there are groups $G$ for which higher-order tensors are unavoidable for obtaining universality. $G$-invariant networks consisting of only first-order tensors are of special interest due to their practical value. We conclude the paper by proving a necessary condition for the universality of $G$-invariant networks that incorporate only first-order tensors. Lastly, we propose a conjecture stating that this condition is also sufficient.

Tue 11 June 14:35 - 14:40 PDT

Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks

Sanjeev Arora · Simon Du · Wei Hu · Zhiyuan Li · Ruosong Wang

Recent works have cast some light on the mystery of why deep nets fit any data and generalize despite being very overparametrized. This paper analyzes training and generalization for a simple 2-layer ReLU net with random initialization, and provides the following improvements over recent works:

(i) Using a tighter characterization of training speed than recent papers, an explanation for why training a neural net with random labels leads to slower training, as originally observed in [Zhang et al. ICLR'17].

(ii) Generalization bound independent of network size, using a data-dependent complexity measure. Our measure distinguishes clearly between random labels and true labels on MNIST and CIFAR, as shown by experiments. Moreover, recent papers require sample complexity to increase (slowly) with the size, while our sample complexity is completely independent of the network size.

(iii) Learnability of a broad class of smooth functions by 2-layer ReLU nets trained via gradient descent.

The key idea is to track dynamics of training and generalization via properties of a related kernel.

Tue 11 June 14:40 - 15:00 PDT

Gauge Equivariant Convolutional Networks and the Icosahedral CNN

Taco Cohen · Maurice Weiler · Berkay Kicanaoglu · Max Welling

The idea of equivariance to symmetry transformations provides one of the first theoretically grounded principles for neural network architecture design. Equivariant networks have shown excellent performance and data efficiency on vision and medical imaging problems that exhibit symmetries. In this paper we show how the theory can be extended from global symmetries to local gauge transformations, which makes it possible in principle to develop equivariant networks on general manifolds.

We implement gauge equivariant CNNs for signals defined on the icosahedron, which provides a reasonable approximation of spherical signals. By choosing to work with this very regular manifold, we are able to implement the gauge equivariant convolution using a single conv2d call, making it a highly scalable and practical alternative to Spherical CNNs.

We evaluate the effectiveness of Icosahedral CNNs on a number of different problems, and show that they yield excellent accuracy and computational efficiency.

Tue 11 June 15:00 - 15:05 PDT

Feature-Critic Networks for Heterogeneous Domain Generalization

Yiying Li · Yongxin Yang · Wei Zhou · Timothy Hospedales

Domain shift is the well-known issue that model performance degrades when deployed to a new target domain with different statistics to training. Domain adaptation techniques alleviate this, but need some instances from the target domain to drive adaptation. Domain generalization is the recently topical problem of learning a model that generalizes to unseen domains out of the box, without accessing any target data. Various domain generalization approaches aim to train a domain-invariant feature extractor, typically by adding some manually designed losses. In this work, we propose a “learning to learn” approach, where the auxiliary loss that helps generalization is itself learned. This approach is conceptually simple and flexible, and leads to considerable improvement in robustness to domain shift. Beyond conventional domain generalization, we consider a more challenging setting of “heterogeneous” domain generalization, where the unseen domains do not share label space with the seen ones, and the goal is to train a feature which is useful off-the-shelf for novel data and novel categories. Experimental evaluation demonstrates that our method outperforms state-of-the-art solutions in both settings.

Tue 11 June 15:05 - 15:10 PDT

Learning to Convolve: A Generalized Weight-Tying Approach

Nichita Diaconu · Daniel E Worrall

Recent work (Cohen & Welling, 2016) has shown that generalizations of convolutions, based on group theory, provide powerful inductive biases for learning. In these generalizations, filters are not only translated but can also be rotated, flipped, etc. However, coming up with exact models of how to rotate a 3x3 filter on a square pixel-grid is difficult.

In this paper, we learn how to transform filters for use in the group convolution, focussing on roto-translation. For this, we learn a filter basis and all rotated versions of that filter basis. Filters are then encoded by a set of rotation invariant coefficients. To rotate a filter, we switch the basis. We demonstrate we can produce feature maps with low sensitivity to input rotations, while achieving high performance on MNIST and CIFAR-10.

Tue 11 June 15:10 - 15:15 PDT

On Dropout and Nuclear Norm Regularization

Poorya Mianjy · Raman Arora

We give a formal and complete characterization of the explicit regularizer induced by dropout in deep linear networks with the squared loss. We show that (a) the explicit regularizer is composed of an $\ell_2$-path regularizer and other terms that are also re-scaling invariant, (b) the convex envelope of the induced regularizer is the squared nuclear norm of the network map, and (c) for a sufficiently large dropout rate, we characterize the global optima of the dropout objective. We validate our theoretical findings with empirical results.

Tue 11 June 15:15 - 15:20 PDT

Gradient Descent Finds Global Minima of Deep Neural Networks

Simon Du · Jason Lee · Haochuan Li · Liwei Wang · Xiyu Zhai

Gradient descent finds a global minimum in training deep neural networks despite the objective function being non-convex. The current paper proves gradient descent achieves zero training loss in polynomial time for a deep over-parameterized neural network with residual connections (ResNet). Our analysis relies on the particular structure of the Gram matrix induced by the neural network architecture. This structure allows us to show the Gram matrix is stable throughout the training process and this stability implies the global optimality of the gradient descent algorithm. We further extend our analysis to deep residual convolutional neural networks and obtain a similar convergence result.

Main Navigation

Session

Deep Learning Theory