Track: Deep Learning (Bayesian) 3

Thu 12 July 8:00 - 8:20 PDT

Neural Autoregressive Flows

Chin-Wei Huang · David Krueger · Alexandre Lacoste · Aaron Courville

Normalizing flows and autoregressive models have been successfully combined to produce state-of-the-art results in density estimation, via Masked Autoregressive Flows (MAF) (Papamakarios et al., 2017), and to accelerate state-of-the-art WaveNet-based speech synthesis to 20x faster than real-time (Oord et al., 2017), via Inverse Autoregressive Flows (IAF) (Kingma et al., 2016). We unify and generalize these approaches, replacing the (conditionally) affine univariate transformations of MAF/IAF with a more general class of invertible univariate transformations expressed as monotonic neural networks. We demonstrate that the proposed neural autoregressive flows (NAF) are universal approximators for continuous probability distributions, and their greater expressivity allows them to better capture multimodal target distributions. Experimentally, NAF yields state-of-the-art performance on a suite of density estimation tasks and outperforms IAF in variational autoencoders trained on binarized MNIST.

Thu 12 July 8:20 - 8:30 PDT

Distilling the Posterior in Bayesian Neural Networks

Kuan-Chieh Wang · Paul Vicol · James Lucas · Li Gu · Roger Grosse · Richard Zemel

In many applications of deep learning, it is crucial to capture model and prediction uncertainty. Unlike classic neural networks (NN), Bayesian neural networks (BNN) allow us to reason about uncertainty in a more principled way. Stochastic Gradient Langevin Dynamics (SGLD) enables learning a BNN with only simple modifications to the standard optimization framework (SGD). Instead of obtaining a single point-estimate of the model, the result of SGLD is samples from the BNN posterior. However, SGLD and its extensions require storage of the entire history of model parameters, a potentially prohibitive cost (especially for large neural networks).We propose a framework, Adversarial Posterior Distillation, to distill the SGLD samples using Generative Adversarial Networks (GAN). At test-time, samples are generated by the GAN. We show that this distillation framework incurs no loss in performance on recent BNN applications including anomaly detection, active learning, and defense against attacks. By construction, our framework not only distills the Bayesian predictive distribution, but the posterior itself. This allows users to compute quantity such as the approximate model variance, which is useful in the downstream tasks.

Thu 12 July 8:30 - 8:40 PDT

Bayesian Uncertainty Estimation for Batch Normalized Deep Networks

Mattias Teye · Hossein Azizpour · Kevin Smith

We show that training a deep network using batch normalization is equivalent to approximate inference in Bayesian models. We further demonstrate that this finding allows us to make meaningful estimates of the model uncertainty using conventional architectures, without modifications to the network or the training procedure. Our approach is thoroughly validated by measuring the quality of uncertainty in a series of empirical experiments on different tasks. It outperforms baselines with strong statistical significance, and displays competitive performance with recent Bayesian approaches.

Thu 12 July 8:40 - 8:50 PDT

Noisy Natural Gradient as Variational Inference

Guodong Zhang · Shengyang Sun · David Duvenaud · Roger Grosse

Variational Bayesian neural nets combine the flexibility of deep learning with Bayesian uncertainty estimation. Unfortunately, there is a tradeoff between cheap but simple variational families (e.g.~fully factorized) or expensive and complicated inference procedures. We show that natural gradient ascent with adaptive weight noise implicitly fits a variational posterior to maximize the evidence lower bound (ELBO). This insight allows us to train full-covariance, fully factorized, or matrix-variate Gaussian variational posteriors using noisy versions of natural gradient, Adam, and K-FAC, respectively, making it possible to scale up to modern-size ConvNets. On standard regression benchmarks, our noisy K-FAC algorithm makes better predictions and matches Hamiltonian Monte Carlo's predictive variances better than existing methods. Its improved uncertainty estimates lead to more efficient exploration in active learning, and intrinsic motivation for reinforcement learning.

Thu 12 July 8:50 - 9:00 PDT

Structured Variational Learning of Bayesian Neural Networks with Horseshoe Priors

Soumya Ghosh · Jiayu Yao · Finale Doshi-Velez

Bayesian Neural Networks (BNNs) have recently received increasing attention for their ability to provide well-calibrated posterior uncertainties. However, model selection---even choosing the number of nodes---remains an open question. Recent work has proposed the use of a horseshoe prior over node pre-activations of a Bayesian neural network, which effectively turns off nodes that do not help explain the data. In this work, we propose several modeling and inference advances that consistently improve the compactness of the model learned while maintaining predictive performance, especially in smaller-sample settings including reinforcement learning.

Main Navigation

Session

Deep Learning (Bayesian) 3

Neural Autoregressive Flows

Distilling the Posterior in Bayesian Neural Networks

Bayesian Uncertainty Estimation for Batch Normalized Deep Networks

Noisy Natural Gradient as Variational Inference

Structured Variational Learning of Bayesian Neural Networks with Horseshoe Priors