Session
Deep Learning Architectures
Invertible Residual Networks
Jens Behrmann · Will Grathwohl · Ricky T. Q. Chen · David Duvenaud · Joern-Henrik Jacobsen
We show that standard ResNet architectures can be made invertible, allowing the same model to be used for classification, density estimation, and generation. Typically, enforcing invertibility requires partitioning dimensions or restricting network architectures. In contrast, our approach only requires adding a simple normalization step during training, already available in standard frameworks. Invertible ResNets define a generative model which can be trained by maximum likelihood on unlabeled data. To compute likelihoods, we introduce a tractable approximation to the Jacobian log-determinant of a residual block. Our empirical evaluation shows that invertible ResNets perform competitively with both state-of-the-art image classifiers and flow-based generative models, something that has not been previously achieved with a single architecture.
NAS-Bench-101: Towards Reproducible Neural Architecture Search
Chris Ying · Aaron Klein · Eric Christiansen · Esteban Real · Kevin Murphy · Frank Hutter
Recent advances in neural architecture search (NAS) demand tremendous computational resources. This makes it difficult to reproduce experiments and imposes a barrier to entry to researchers without access to large scale computation. We aim to ameliorate these problems by introducing NAS-Bench-101, the first public architecture dataset for NAS research. To build it, we carefully constructed a compact---yet expressive---search space, exploiting graph isomorphisms to identify 423K unique architectures. Utilizing machine-years of computation, we trained them all with public code, and compiled the results into a large table. This allows researchers to evaluate the quality of a proposed model in milliseconds using various precomputed metrics. NAS-Bench-101 presents a unique opportunity to study the entire NAS loss landscape from a data-driven perspective, which we illustrate with our analysis. We also demonstrate the dataset's application to benchmarking by comparing a range of popular architecture optimization algorithms on it.
Approximated Oracle Filter Pruning for Destructive CNN Width Optimization
XIAOHAN DING · guiguang ding · Yuchen Guo · Jungong Han · Chenggang Yan
It is never easy to design and run Convolutional Neural Networks (CNNs) due to: 1) no one knows the optimal number of filters at each layer, given a network architecture; and 2) the computational intensity of CNNs impedes the deployment on computationally limited devices. The need for an automatic method to optimize the number of filters, i.e., the width of convolutional layers, brings us to Oracle Pruning, which is the most accurate filter pruning method but suffers from intolerant time complexity. To address this problem, we propose Approximated Oracle Filter Pruning (AOFP), a training-time filter pruning framework, which is practical on very deep CNNs. By AOFP, we can prune an existing deep CNN with acceptable time cost, negligible accuracy drop and no heuristic knowledge, or re-design a model which exerts higher accuracy and faster inference.
LegoNet: Efficient Convolutional Neural Networks with Lego Filters
Zhaohui Yang · Yunhe Wang · Chuanjian Liu · Hanting Chen · Chunjing Xu · Boxin Shi · Chao Xu · Chang Xu
This paper aims to build efficient convolutional neural networks using a set of Lego filters. Many successful building blocks, e.g., inception and residual modules, have been designed to refresh state-of-the-art records of CNNs on visual recognition tasks. Beyond these high-level modules, we suggest that an ordinary filter in the neural network can be upgraded to a sophisticated module as well. Filter modules are established by assembling a shared set of Lego filters that are often of much lower dimensions. Weights in Lego filters and binary masks to stack Lego filters for these filter modules can be simultaneously optimized in an end-to-end manner as usual. Inspired by network engineering, we develop a split-transform-merge strategy for an efficient convolution by exploiting intermediate Lego feature maps. The compression and acceleration achieved by Lego Networks using the proposed Lego filters have been theoretically discussed. Experimental results on benchmark datasets and deep models demonstrate the advantages of the proposed Lego filters and their potential real-world applications on mobile devices.
Sorting Out Lipschitz Function Approximation
Cem Anil · James Lucas · Roger Grosse
Training neural networks subject to a Lipschitz constraint is useful for generalization bounds, provable adversarial robustness, interpretable gradients, and Wasserstein distance estimation. By the composition property of Lipschitz functions, it suffices to ensure that each individual affine transformation or nonlinear activation function is 1-Lipschitz. The challenge is to do this while maintaining the expressive power. We identify a necessary property for such an architecture: each of the layers must preserve the gradient norm during backpropagation. Based on this, we propose to combine a gradient norm preserving activation function, GroupSort, with norm-constrained weight matrices. We show that norm-constrained GroupSort architectures are universal Lipschitz function approximators. Empirically, we show that norm-constrained GroupSort networks achieve tighter estimates of Wasserstein distance than their ReLU counterparts and can achieve provable adversarial robustness guarantees with little cost to accuracy.
Graph Element Networks: adaptive, structured computation and memory
Ferran Alet · Adarsh Keshav Jeewajee · Maria Bauza Villalonga · Alberto Rodriguez · Tomas Lozano-Perez · Leslie Kaelbling
We explore the use of graph-structured neural-networks (GNNs) to model spatial processes in which there is no {\em a priori} graphical structure. Similar to {\em finite element analysis}, we assign nodes of a GNN to spatial locations and use a computational process defined on the graph to model the relationship between an initial function defined over a space and a resulting function. The encoding of inputs to node states, the decoding of node states to outputs, as well as the mappings defining the GNN are learned from a training set consisting of data from multiple function pairs. The locations of the nodes in space as well as their connectivity can be adjusted during the training process. This graph-based representational strategy allows the learned input-output relationship to generalize over the size and even topology of the underlying space. We demonstrate this method on a traditional PDE problem, a physical prediction problem from robotics, and a problem of learning to predict scene images from novel viewpoints.
Training CNNs with Selective Allocation of Channels
Jongheon Jeong · Jinwoo Shin
Recent progress in deep convolutional neural networks (CNNs) have enabled a simple paradigm of architecture design: larger models typically achieve better accuracy. Due to this, in modern CNN architectures, it becomes more important to design models that generalize well under certain resource constraints, e.g. the number of parameters. In this paper, we propose a simple way to improve the capacity of any CNN model having large-scale features, without adding more parameters. In particular, we modify a standard convolutional layer to have a new functionality of channel-selectivity, so that the layer is trained to select important channels to re-distribute their parameters. Our experimental results under various CNN architectures and datasets demonstrate that the proposed new convolutional layer allows new optima that generalize better via efficient resource utilization, compared to the baseline.
Equivariant Transformer Networks
Kai Sheng Tai · Peter Bailis · Gregory Valiant
How can prior knowledge on the transformation invariances of a domain be incorporated into the architecture of a neural network? We propose Equivariant Transformers (ETs), a family of differentiable image-to-image mappings that improve the robustness of models towards pre-defined continuous transformation groups. Through the use of specially-derived canonical coordinate systems, ETs incorporate functions that are equivariant by construction with respect to these transformations. We show empirically that ETs can be flexibly composed to improve model robustness towards more complicated transformation groups in several parameters. On a real-world image classification task, ETs improve the sample efficiency of ResNet classifiers, achieving relative improvements in error rate of up to 15% in the limited data regime while increasing model parameter count by less than 1%.
Overcoming Multi-model Forgetting
Yassine Benyahia · Kaicheng Yu · Kamil Bennani-Smires · Martin Jaggi · Anthony C. Davison · Mathieu Salzmann · Claudiu Musat
We identify a phenomenon, which we refer to as multi-model forgetting, that occurs when sequentially training multiple deep networks with partially-shared parameters; the performance of previously-trained models degrades as one optimizes a subsequent one, due to the overwriting of shared parameters. To overcome this, we introduce a statistically-justified weight plasticity loss that regularizes the learning of a model's shared parameters according to their importance for the previous models, and demonstrate its effectiveness when training two models sequentially and for neural architecture search. Adding weight plasticity in neural architecture search preserves the best models to the end of the search and yields improved results in both natural language processing and computer vision tasks.
Bayesian Nonparametric Federated Learning of Neural Networks
Mikhail Yurochkin · Mayank Agarwal · Soumya Ghosh · Kristjan Greenewald · Nghia Hoang · Yasaman Khazaeni
In federated learning problems, data is scattered across different servers and exchanging or pooling it is often impractical or prohibited. We develop a Bayesian nonparametric framework for federated learning with neural networks. Each data server is assumed to provide local neural network weights, which are modeled through our framework. We then develop an inference approach that allows us to synthesize a more expressive global network without additional supervision, data pooling and with as few as a single communication round. We then demonstrate the efficacy of our approach on federated learning problems simulated from two popular image classification datasets.