Session
Deep Learning
Understanding and correcting pathologies in the training of learned optimizers
Luke Metz · Niru Maheswaranathan · Jeremy Nixon · Daniel Freeman · Jascha Sohl-Dickstein
Deep learning has shown that learned functions can dramatically outperform hand-designed functions on perceptual tasks. Analogously, this suggests that learned optimizers may similarly outperform current hand-designed optimizers, especially for specific problems. However, learned optimizers are notoriously difficult to train and have yet to demonstrate wall-clock speedups over hand-designed optimizers, and thus are rarely used in practice. Typically, learned optimizers are trained by truncated backpropagation through an unrolled optimization process. The resulting gradients are either strongly biased (for short truncations) or have exploding norm (for long truncations). In this work we propose a training scheme which overcomes both of these difficulties, by dynamically weighting two unbiased gradient estimators for a variational loss on optimizer performance. This allows us to train neural networks to perform optimization of a specific task faster than tuned first-order methods. Moreover, by training the optimizer against validation loss (as opposed to training loss), we are able to learn optimizers that train networks to generalize better than first order methods. We demonstrate these results on problems where our learned optimizer trains convolutional networks faster in wall-clock time compared to tuned first-order methods and with an improvement in test loss.
Dropout is a popular technique to train large-scale deep neural networks to alleviate the overfitting problem. To disclose the underlying reasons for its gain, numerous works have tried to explain it from different perspectives. In this paper, unlike existing works, we explore it from a new perspective to provide new insight into this line of research. In detail, we disentangle the forward and backward pass of dropout. Then, we find that these two passes need different levels of noise to improve the generalization performance of deep neural networks. Based on this observation, we propose the augmented dropout which employs different dropping strategies in the forward and backward pass. Experimental results have verified the effectiveness of our proposed method.
We propose a new architecture of the capsule network called the ladder capsule network, which has an alternative building block to the dynamic routing algorithm in the capsule network (Sabour et al., 2017). Motivated by the need for using only important capsules during training for robust performance, we first introduce a new layer called the pruning layer, which removes irrelevant capsules. Based on the selected capsules, we construct higher-level capsule outputs. Subsequently, to capture the part-whole spatial relationships, we introduce another new layer called the ladder layer, the outputs of which are regressed lower-level capsule outputs from higher-level capsules. Unlike the capsule network adopting the routing-by agreement, the ladder capsule network uses backpropagation from a loss function to reconstruct the lower-level capsule outputs from higher-level capsules; thus, the ladder layer implements the reverse directional inference of the agreement/disagreement mechanism of the capsule network. The experiments on MNIST demonstrate that the ladder capsule network learns an equivariant representation and improves the capability to extrapolate or generalize to pose variations.
Unreproducible Research is Reproducible
Xavier Bouthillier · César Laurent · Pascal Vincent
The apparent contradiction of the title is a wordplay on the different meanings attributed to the word \emph{reproducible} across different scientific fields. What we imply is that unreproducible \emph{findings} can be built upon reproducible \emph{methods}. Without denying the importance of facilitating the reproduction of \emph{methods}, we deem important to reassert that reproduction of \emph{findings} is a fundamental step of the scientific inquiry. We argue that the commendable quest towards easy deterministic reproducibility of methods and numerical results, should not have us forget the even more important necessity of ensuring the reproducibility of empirical conclusions and findings by properly accounting for essential sources of variations. We provide experiments to exemplify the limitations of current evaluation methods of models in the field of deep learning, showing that even if the results could be reproduced, a slightly different experiment would not support the findings. This work is an attempt to promote the use of more rigorous and diversified methodologies. It is not an attempt to impose a new methodology and it is not a critique on the nature of exploratory research. We hope to help clarify the distinction between exploratory and empirical research in the field of deep learning and believe more energy should be devoted to proper empirical research in our community.
Geometric Scattering for Graph Data Analysis
Feng Gao · Guy Wolf · Matthew Hirn
We explore the generalization of scattering transforms from traditional (e.g., image or audio) signals to graph data, analogous to the generalization of ConvNets in geometric deep learning, and the utility of extracted graph features in graph data analysis. In particular, we focus on the capacity of these features to retain informative variability and relations in the data (e.g., between individual graphs, or in aggregate), while relating our construction to previous theoretical results that establish the stability of similar transforms to families of graph deformations. We demonstrate the application the our geometric scattering features in graph classification of social network data, and in data exploration of biochemistry data.
Robust Inference via Generative Classifiers for Handling Noisy Labels
Kimin Lee · Sukmin Yun · Kibok Lee · Honglak Lee · Bo Li · Jinwoo Shin
Large-scale datasets may contain significant proportions of noisy (incorrect) class labels, and it is well-known that modern deep neural networks (DNNs) poorly generalize from such noisy training datasets. To mitigate the issue, we propose a novel inference method, termed Robust Generative classifier (RoG), applicable to any discriminative (e.g., softmax) neural classifier pre-trained on noisy datasets. In particular, we induce a generative classifier on top of hidden feature spaces of the pre-trained DNNs, for obtaining a more robust decision boundary. By estimating the parameters of generative classifier using the minimum covariance determinant estimator, we significantly improve the classification accuracy with neither re-training of the deep model nor changing its architectures. With the assumption of Gaussian distribution for features, we prove that RoG generalizes better than baselines under noisy labels. Finally, we propose the ensemble version of RoG to improve its performance by investigating the layer-wise characteristics of DNNs. Our extensive experimental results demonstrate the superiority of RoG given different learning models optimized by several training techniques to handle diverse scenarios of noisy labels.
LIT: Learned Intermediate Representation Training for Model Compression
Animesh Koratana · Daniel Kang · Peter Bailis · Matei Zaharia
Researchers have proposed a range of model compression techniques to reduce the computational and memory footprint of deep neural networks (DNNs). In this work, we introduce Learned Intermediate representation Training (LIT), a novel model compression technique that outper- forms a range of recent model compression techniques by leveraging the highly repetitive struc- ture of modern DNNs (e.g., ResNet). LIT uses a teacher DNN to train a student DNN of reduced depth by leveraging two key ideas: 1) LIT directly compares intermediate representations of the teacher and student model and 2) LIT uses the intermediate representation from the teacher model’s previous block as input to the current student block during training, improving stability of intermediate representations in the student network. We show that LIT can substantially reduce network size without loss in accuracy on a range of DNN architectures and datasets. For example, LIT can compress ResNet on CIFAR10 by 3.7× compared to 1.5× and 2.6× by network slimming and FitNets respectively (by weight). Furthermore, LIT can compress, by depth, ResNeXt 5.5× on CIFAR10 (image classification), VDCNN by 1.7× on Amazon Reviews (sentiment analysis), and StarGAN by 1.8× on CelebA (style transfer, i.e., GANs).
Analyzing and Improving Representations with the Soft Nearest Neighbor Loss
Nicholas Frosst · Nicolas Papernot · Geoffrey Hinton
We introduce the Soft Nearest Neighbor Loss that measures the entanglement of class manifolds in representation space: i.e., how close pairs of points from the same class are relatively to pairs of points from different classes. We demonstrate several use cases of the loss. As an analytical tool, it provides insights into the evolution of similarity structures during learning. Surprisingly, we find that maximizing the entanglement of representations of different classes in the hidden layers is beneficial for discriminating, possibly because it encourages representations to identify class-independent similarity structures. Maximizing the soft nearest neighbor loss in the hidden layers leads not only to improved generalization but also to better-calibrated estimates of uncertainty on outlier data. Data that is not from the training distribution can be recognized by observing that in the hidden layers, it has fewer than the normal number of neighbors from the predicted class.
What is the Effect of Importance Weighting in Deep Learning?
Jonathon Byrd · Zachary Lipton
Importance-weighted risk minimization is a key ingredient in many machine learning algorithms for causal inference, domain adaptation, class imbalance, and off-policy reinforcement learning. While the effect of importance weighting is well-characterized for low-capacity misspecified models, little is known about how it impacts over-parameterized, deep neural networks. This work is inspired by recent theoretical results showing that on (linearly) separable data, deep linear networks optimized by SGD learn weight-agnostic solutions, prompting us to ask, for realistic deep networks, for which many practical datasets are separable, what is the effect of importance weighting? We present the surprising finding that while importance weighting impacts models early in training, its effect diminishes over successive epochs. Moreover, while L2 regularization and batch normalization (but not dropout), restore some of the impact of importance weighting, they express the effect via (seemingly) the wrong abstraction: why should practitioners tweak the L2 regularization, and by how much, to produce the correct weighting effect? Our experiments confirm these findings across a range of architectures and datasets.
Similarity of Neural Network Representations Revisited
Simon Kornblith · Mohammad Norouzi · Honglak Lee · Geoffrey Hinton
Recent work has sought to understand the behavior of neural networks by comparing representations between layers and between different trained models. We examine methods for comparing neural network representations based on canonical correlation analysis (CCA). We show that CCA belongs to a family of statistics for measuring multivariate similarity, but that neither CCA nor any other statistic that is invariant to invertible linear transformation can measure meaningful similarities between representations of higher dimension than the number of available data points. We introduce a similarity index that measures the relationship between representational similarity matrices and does not suffer from this limitation. We show that this similarity index is equivalent to centered kernel-tangent alignment (KTA) and is also closely connected to CCA. Unlike other methods, KTA can reliably identify correspondences between representations of layers in networks trained from different initializations.