Track: Deep Learning Algorithms

Wed 12 June 16:00 - 16:20 PDT

How does Disagreement Help Generalization against Label Corruption?

Xingrui Yu · Bo Han · Jiangchao Yao · Gang Niu · Ivor Tsang · Masashi Sugiyama

Learning with noisy labels is one of the hottest problems in weakly-supervised learning. Based on memorization effects of deep neural networks, training on small-loss samples becomes very promising for handling noisy labels. This fosters the state-of-the-art approach "Co-teaching" that cross-trains two deep neural networks using small-loss trick. However, with the increase of epochs, two networks will converge to a consensus gradually and Co-teaching reduces to the self-training MentorNet. To tackle this issue, we propose a robust learning paradigm called Co-teaching+, which bridges the "Update by Disagreement" strategy with the original Co-teaching. First, two networks predict all data, and feed forward prediction disagreement data only. Then, among such disagreement data, each network selects its small-loss data, but back propagates the small-loss data by its peer network and updates its own parameters. Empirical results on noisy benchmark datasets demonstrate that Co-teaching+ is much superior to many state-of-the-art methods in the robustness of trained models.

Wed 12 June 16:20 - 16:25 PDT

EigenDamage: Structured Pruning in the Kronecker-Factored Eigenbasis

Chaoqi Wang · Roger Grosse · Sanja Fidler · Guodong Zhang

Reducing the test time resource requirements of a neural network while preserving test accuracy is crucial for running inference on low-power devices. To achieve this goal, we introduce a novel network reparameterization based on the Kronecker-factored eigenbasis (KFE), and then apply Hessian-based structured pruning methods in this basis. As opposed to existing Hessian-based pruning algorithms which do pruning in parameter coordinates, our method works in the KFE where different weights are approximately independent, enabling accurate pruning and fast computation.We demonstrate empirically the effectiveness of the proposed method through extensive experiments. In particular, we highlight that the improvements are especially significant for more challenging datasets and networks. With negligible loss of accuracy, a iterative-pruning version gives a 10x reduction in model size and a 8x reduction in FLOPs on wide ResNet32.

Wed 12 June 16:25 - 16:30 PDT

Addressing the Loss-Metric Mismatch with Adaptive Loss Alignment

Chen Huang · Shuangfei Zhai · Walter Talbott · Miguel Angel Bautista Martin · Shih-Yu Sun · Carlos Guestrin · Joshua M Susskind

In most machine learning training paradigms a fixed, often handcrafted, loss function is assumed to be a good proxy for an underlying evaluation metric. In this work we assess this assumption by meta-learning an adaptive loss function to directly optimize the evaluation metric. We propose a sample efficient reinforcement learn- ing approach for adapting the loss dynamically during training. We empirically show how this formulation improves performance by simultaneously optimizing the evaluation metric and smoothing the loss land- scape. We verify our method in metric learning and classification scenarios, showing consider- able improvements over the state-of-the-art on a diverse set of tasks. Importantly, our method is applicable to a wide range of loss functions and evaluation metrics. Furthermore, the learned policies are transferable across tasks and data, demonstrating the versatility of the method.

Wed 12 June 16:30 - 16:35 PDT

Deep Compressed Sensing

Yan Wu · Mihaela Rosca · Timothy Lillicrap

Compressed sensing (CS) provides an elegant framework for recovering sparse signals from compressed measurements. For example, CS can exploit the structure of natural images and recover an image from only a few random measurements. CS is highly flexible and data efficient, but its application has been restricted by the strong assumption of sparsity and costly optimisation process. A recent approach that combines CS with neural network generators has removed the constraint of sparsity, but reconstruction remains slow. Here we propose a novel framework that significantly improves both the performance and speed of signal recovery by jointly training a generator and the optimisation process for reconstruction via meta-learning. We explore training the measurements with different objectives, and derive a family of models based on minimising measurement errors. We show that Generative Adversarial Nets (GANs) can be viewed as a special case in this family of models. Borrowing insights from the CS perspective, we develop a novel way of stabilising GAN training using gradient information from the discriminator.

Wed 12 June 16:35 - 16:40 PDT

Differentiable Dynamic Normalization for Learning Deep Representation

Ping Luo · Peng Zhanglin · Shao Wenqi · Zhang ruimao · Ren jiamin · Wu lingyun

This work presents Dynamic Normalization (DN), which is able to learn arbitrary normalization operations for different convolutional layers in a deep ConvNet. Unlike existing normalization approaches that predefined computations of the statistics (mean and variance), DN learns to estimate them. DN has several appealing benefits. First, it adapts to various networks, tasks, and batch sizes. Second, it can be easily implemented and trained in a differentiable end-to-end manner with merely small number of parameters. Third, its matrix formulation represents a wide range of normalization methods, shedding light on analyzing them theoretically. Extensive studies show that DN outperforms its counterparts in CIFAR10 and ImageNet.

Wed 12 June 16:40 - 17:00 PDT

Toward Understanding the Importance of Noise in Training Neural Networks

Mo Zhou · Tianyi Liu · Yan Li · Dachao Lin · Enlu Zhou · Tuo Zhao

Numerous empirical evidence has corroborated that the noise plays a crucial rule in effective and efficient training of deep neural networks. The theory behind, however, is still largely unknown. This paper studies this fundamental problem through training a simple two-layer convolutional neural network model. Although training such a network requires to solve a non-convex optimization problem with a spurious local optimum and a global optimum, we prove that a perturbed gradient descent algorithm in conjunction with noise annealing is guaranteed to converge to a global optimum in polynomial time with arbitrary initialization. This implies that the noise enables the algorithm to efficiently escape from the spurious local optimum. Numerical experiments are provided to support our theory.

Wed 12 June 17:00 - 17:05 PDT

Cheap Orthogonal Constraints in Neural Networks: A Simple Parametrization of the Orthogonal and Unitary Group

Mario Lezcano Casado · David Martínez-Rubio

We introduce a novel approach to perform first-order optimization with orthogonal and unitary constraints. This approach is based on a parametrization stemming from Lie group theory through the exponential map. The parametrization transforms the constrained optimization problem into an unconstrained one over a Euclidean space, for which common first-order optimization methods can be used. The theoretical results presented are general enough to cover the special orthogonal group, the unitary group and, in general, any connected compact Lie group. We discuss how this and other parametrizations can be computed efficiently through an implementation trick, making numerically complex parametrizations usable at a negligible runtime cost in neural networks. In particular, we apply our results to RNNs with orthogonal recurrent weights, yielding a new architecture called expRNN. We demonstrate how our method constitutes a more robust approach to optimization with orthogonal constraints, showing faster, accurate, and more stable convergence in several tasks designed to test RNNs.

Wed 12 June 17:05 - 17:10 PDT

Breaking Inter-Layer Co-Adaptation by Classifier Anonymization

Ikuro Sato · Kohta Ishikawa · Guoqing Liu · Masayuki Tanaka

This study addresses an issue of co-adaptation between a feature extractor and a classifier in a neural network. A na\"ive joint optimization of a feature extractor and a classifier often brings situations in which an excessively complex feature distribution adapted to a very specific classifier degrades the test performance. We introduce a method called Feature-extractor Optimization through Classifier Anonymization (FOCA), which is designed to avoid an explicit co-adaptation between a feature extractor and a particular classifier by using many randomly-generated, weak classifiers during optimization. We put forth a mathematical proposition that states the FOCA features form a point-like distribution within the same class in a class-separable fashion under special conditions. Real-data experiments under more general conditions provide supportive evidences.

Wed 12 June 17:10 - 17:15 PDT

Understanding the Impact of Entropy on Policy Optimization

Zafarali Ahmed · Nicolas Le Roux · Mohammad Norouzi · Dale Schuurmans

Entropy regularization is commonly used to improve policy optimization in reinforcement learning. It is believed to help with exploration by encouraging the selection of more stochastic policies. In this work, we analyze this claim using new visualizations of the optimization landscape based on randomly perturbing the loss function. We first show that even with access to the exact gradient, policy optimization is difficult due to the geometry of the objective function. Then, we qualitatively show that in some environments, a policy with higher entropy can make the optimization landscape smoother, thereby connecting local optima and enabling the use of larger learning rates. This manuscript also presents new tools for understanding the optimization landscape, shows that policy entropy serves as a regularizer, and highlights the challenge of designing general-purpose policy optimization algorithms.

Wed 12 June 17:15 - 17:20 PDT

Probability Functional Descent: A Unifying Perspective on GANs, Variational Inference, and Reinforcement Learning

Casey Chu · Jose Blanchet · Peter Glynn

The goal of this paper is to provide a unifying view of a wide range of problems of interest in machine learning by framing them as the minimization of functionals defined on the space of probability measures. In particular, we show that generative adversarial networks, variational inference, and actor-critic methods in reinforcement learning can all be seen through the lens of our framework. We then discuss a generic optimization algorithm for our formulation, called probability functional descent (PFD), and show how this algorithm recovers existing methods developed independently in the settings mentioned earlier.

Main Navigation

Session

Deep Learning Algorithms