Track: Bayesian Deep Learning

Thu 13 June 16:00 - 16:20 PDT

Probabilistic Neural Symbolic Models for Interpretable Visual Question Answering

Shanmukha Ramakrishna Vedantam · Karan Desai · Stefan Lee · Marcus Rohrbach · Dhruv Batra · Devi Parikh

We propose a new class of probabilistic neural-symbolic models, that has symbolic functional programs as a latent, stochastic variable. Instantiated in the context of visual question answering, our probabilistic formulation offers two key conceptual advantages over prior neural-symbolic models for VQA. Firstly, the programs generated by our model are more understandable while requiring lesser number of teaching examples. Secondly, we show that one can pose counterfactual scenarios to the model, to probe its beliefs on the questions or programs that could lead to a specified answer given an image. Our results on a dataset of compositional questions about SHAPES verify our hypotheses, showing that the model gets better program (and answer) prediction accuracy even in the low data regime, and allows one to probe the coherence and consistency of reasoning performed.

Thu 13 June 16:20 - 16:25 PDT

Nonparametric Bayesian Deep Networks with Local Competition

Konstantinos Panousis · Sotirios Chatzis · Sergios Theodoridis

The aim of this work is to enable inference of deep networks that retain high accuracy for the least possible model complexity, with the latter deduced from the data during inference. To this end, we revisit deep networks that comprise competing linear units, as opposed to nonlinear units that do not entail any form of (local) competition. In this context, our main technical innovation consists in an inferential setup that leverages solid arguments from Bayesian nonparametrics. We infer both the needed set of connections or locally competing sets of units, as well as the required floating-point precision for storing the network parameters. Specifically, we introduce auxiliary discrete latent variables representing which initial network components are actually needed for modeling the data at hand, and perform Bayesian inference over them by imposing appropriate stick-breaking priors. As we experimentally show using benchmark datasets, our approach yields networks with less computational footprint than the state-of-the-art, and with no compromises in predictive accuracy.

Thu 13 June 16:25 - 16:30 PDT

Good Initializations of Variational Bayes for Deep Models

Simone Rossi · Pietro Michiardi · Maurizio Filippone

Stochastic variational inference is an established way to carry out approximate Bayesian inference for deep models. While there have been effective proposals for good initializations for loss minimization in deep learning, far less attention has been devoted to the issue of initialization of stochastic variational inference. We address this by proposing a novel layer-wise initialization strategy based on Bayesian linear models. The proposed method is extensively validated on regression and classification tasks, including Bayesian DeepNets and ConvNets, showing faster and better convergence compared to alternatives inspired by the literature on initializations for loss minimization.

Thu 13 June 16:30 - 16:35 PDT

Dropout as a Structured Shrinkage Prior

Eric Nalisnick · Jose Miguel Hernandez-Lobato · Padhraic Smyth

Dropout regularization of deep neural networks has been a mysterious yet effective tool to prevent overfitting. Explanations for its success range from the prevention of "co-adapted" weights to it being a form of cheap Bayesian inference. We propose a novel framework for understanding multiplicative noise in neural networks, considering continuous distributions as well as Bernoulli noise (i.e. dropout). We show that multiplicative noise induces structured shrinkage priors on a network's weights. We derive the equivalence through reparametrization properties of scale mixtures and without invoking any approximations. Given the equivalence, we then show that dropout's Monte Carlo training objective approximates marginal MAP estimation. We extend this framework to ResNets, terming the prior "automatic depth determination" as it is the natural analog of "automatic relevance determination" for network depth. Lastly, we investigate two inference strategies that improve upon the aforementioned MAP approximation in regression benchmarks.

Thu 13 June 16:35 - 16:40 PDT

ARSM: Augment-REINFORCE-Swap-Merge Estimator for Gradient Backpropagation Through Categorical Variables

Mingzhang Yin · Yuguang Yue · Mingyuan Zhou

To address the challenge of backpropagating the gradient through categorical variables, we propose the augment-REINFORCE-swap-merge (ARSM) gradient estimator that is unbiased and has low variance. ARSM first uses variable augmentation, REINFORCE, and Rao-Blackwellization to re-express the gradient as an expectation under the Dirichlet distribution, then uses variable swapping to construct differently expressed but equivalent expectations, and finally shares common random numbers between these expectations to achieve significant variance reduction. Experimental results show ARSM closely resembles the performance of the true gradient for optimization in univariate settings; outperforms existing estimators by a large margin when applied to categorical variational auto-encoders; and provides a "try-and-see self-critic" variance reduction method for discrete-action policy gradient, which removes the need of estimating baselines by generating a random number of pseudo actions and estimating their action-value functions.

Thu 13 June 16:40 - 17:00 PDT

On Variational Bounds of Mutual Information

Ben Poole · Sherjil Ozair · Aäron van den Oord · Alexander Alemi · George Tucker

Estimating, minimizing, and/or maximizing Mutual Information (MI) is core to many objectives in machine learning, but tractably bounding MI in high dimensions is challenging. To establish tractable and scalable objectives, recent work has turned to variational bounds parameterized by neural networks (Alemi et al., 2016, Belghazi et al., 2018, van den Oord et al., 2018). However, the relationships and tradeoffs between these bounds remains unclear. In this work, we unify these recent developments in a single framework. We find that the existing variational lower bounds degrade when the MI is large, exhibiting either high bias or high variance. To address this problem, we introduce a continuum of lower bounds that encompasses previous bounds and flexibly trades off bias and variance. On a suite of high-dimensional, controlled problems, we empirically characterize the bias and variance of both the bounds and their gradients and demonstrate the effectiveness of these new bounds for estimation and representation learning.

Thu 13 June 17:00 - 17:05 PDT

Partially Exchangeable Networks and Architectures for Learning Summary Statistics in Approximate Bayesian Computation

Samuel Wiqvist · Pierre-Alexandre Mattei · Umberto Picchini · Jes Frellsen

We present a novel family of deep neural architectures, named partially exchangeable networks (PENs) that leverage probabilistic symmetries. By design, PENs are invariant to block-switch transformations, which characterize the partial exchangeability properties of conditionally Markovian processes. Moreover, we show that any block-switch invariant function has a PEN-like representation. The DeepSets architecture is a special case of PEN and we can therefore also target fully exchangeable data. We employ PENs to learn summary statistics in approximate Bayesian computation (ABC). When comparing PENs to previous deep learning methods for learning summary statistics, our results are highly competitive, both considering time series and static models. Indeed, PENs provide more reliable posterior samples even when using less training data.

Thu 13 June 17:05 - 17:10 PDT

Hierarchical Importance Weighted Autoencoders

Chin-Wei Huang · Kris Sankaran · Eeshan Dhekane · Alexandre Lacoste · Aaron Courville

Importance weighted variational inference (Burda et al., 2016) uses multiple i.i.d. samples to have a tighter variational lower bound. We believe a joint proposal has the potential of reducing the number of redundant samples, and introduce a hierarchical structure to induce correlation. The hope is that the proposals would coordinate to make up for the error made by one another, and reduce the variance as a whole. Theoretically, we analyze the condition under which convergence of the estimator variance can be connected to convergence of the lower bound. Empirically, we confirm that maximization of the lower bound does implicitly minimize variance. Further analysis shows that this is a result of negative correlation induced by the proposed hierarchical meta sampling scheme, and performance of inference also improves when number of samples increases.

Thu 13 June 17:10 - 17:15 PDT

Faster Attend-Infer-Repeat with Tractable Probabilistic Models

Karl Stelzner · Robert Peharz · Kristian Kersting

The recent attend-infer-repeat (AIR) framework marks a milestone in Bayesian scene understanding and in the promising avenue of structured probabilistic modeling. The AIR model expresses the composition of visual scenes from individual objects, and uses variational autoencoders to model the appearance of those objects. However, inference in the overall model is highly intractable, which hampers its learning speed and makes it prone to sub-optimal solutions. In this paper, we show that inference and learning in AIR can be considerably accelerated by replacing the intractable object representations with tractable probabilistic models. In particular, we opt for sum-product (SP) networks, an expressive deep probabilistic model with a rich set of tractable inference routines. As our empirical evidence shows, the resulting model, called SPAIR, achieves a higher object detection accuracy than the original AIR system, while reducing the learning time by an order of magnitude. Moreover, SPAIR allows one to treat object occlusions in a consistent manner and to include a background noise model, improving the robustness of Bayesian scene understanding.

Thu 13 June 17:15 - 17:20 PDT

Understanding Priors in Bayesian Neural Networks at the Unit Level

Mariia Vladimirova · Jakob Verbeek · Pablo Mesejo · Julyan Arbel

We investigate deep Bayesian neural networks with Gaussian priors on the weights and a class of ReLU-like nonlinearities. Bayesian neural networks with Gaussian priors are well known to induce an L2, ``weight decay'', regularization. Our results indicate a more intricate regularization effect at the level of the unit activations. Our main result establishes that the induced prior distribution on the units before and after activation becomes increasingly heavy-tailed with the depth of the layer. We show that first layer units are Gaussian, second layer units are sub-exponential, and units in deeper layers are characterized by sub-Weibull distributions. Our results provide new theoretical insight on deep Bayesian neural networks, which we corroborate with experimental simulation results on convolutional networks.

Main Navigation

Session

Bayesian Deep Learning