Paper ID: 699
Title: Auxiliary Deep Generative Models

Review #1
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): The variational distribution for deep generative models can be made more flexible through the addition of auxiliary variables. Empirical results suggest that more flexible models converge faster, and represent the posterior more accurately. 

Clarity - Justification: It’s pretty clear. There are a number of places where the singular form of a verb is substituted for the plural. For example, on lines 31, 69, and 180, change “has” to “have”, and on line 138 change “is” to “are”, and on line 79 change “offers” to “offer”. Around line 409, I’d suggest dropping “is parameterized as”, since q just is a fully factorized Gaussian (or, perhaps technically, the density function for a fully factorized Gaussian). Or you could say the encoder network parameterizes q, a fully factorized Gaussian.

Significance - Justification: The results are encouraging. In terms of the derivation, it's a pretty straightforward application of "Hierarchical variational models" (Ranganath, et al 2015) to deep generative models. The paper "Variational Gaussian processes" (Tran, et al., ICLR 2016) goes further than either paper, introducing a variational distribution that's flexible enough to match any posterior, and that produces state-of-the-art results on MNIST (unsupervised) when coupled with the DRAW model. If that work catches on, this work may not end up being widely cited.

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The equation on line 134, that p(x, z, a) = p(a | x, z)p(x, z), seems vacuous: it must be so by the rules of probability (product rule).  Is it necessary to include `a` in the generative model? Could `a` just be used to specify the variational distribution, and get marginalized out before computing the KL divergence between q and p? I think that’s what they do in “Hierarchical Variational Models” (Ranganath, et al.; NIPS approximate inference workshop, 2015). It would be cleaner not to have to change the generative model at all, just to get a more flexible variational distribution.  I wonder how the proposed model would compare to a standard VAE with a block-diagonal covariance matrix (a full covariance matrix for each example), rather than a fully factorized one.  The experimental results are compelling.

=====

Review #2
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): This paper is about auxiliary variables in variational approximations of deep generative models. The paper proposes extending the variational distribution with auxiliary variables to make it more expressive. Experiments on standard benchmarks are performed with comparisons against reasonable alternatives. 

Clarity - Justification: This paper was very well written. (But a heads up for future reference: acknowledgments should not appear in papers under review). 

Significance - Justification: More expressive variational approximations are a clear need for the community, and this paper addresses that need with an interesting approach that is well explained. 

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Here is a suggested reference that is highly relevant: Variatonal Gaussian Process, Tran et al. ICLR 2016 http://arxiv.org/pdf/1511.06499v2.pdf  

=====

Review #3
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): The paper uses the auxiliary variable method in variable inference to provide a richer posterior in a semi-supervised variational autoencoder. 

Clarity - Justification: The paper is well written and all details are provided to reimplement if necessary. 

Significance - Justification: The auxiliary variable method is not novel, but it is properly credited, and its use in a semi-supervised VAE is novel. More importantly, this simple modification to the inference (with no meaningful modification to the generative model, except for the SDGM) yields impressive improvement over the baselines (the comparison between M1+M2, the closest competitor in terms of architecture, and ADGM/SDGM, is striking). This shows the value of not only the authors' approach, but also - more generally - the importance of strong posteriors in a fixed generative model. It also exhibits good performance in fitting strongly multi-modal posterior distributions, a generally difficult problem.

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): ** I believe that always using the labeled examples in the minibatch amounts to using different weights for the unlabeled and label terms in equation (15) (if we see minibatches as an unbiased subsample of that cost function).  ** As hinted in the conclusion, I would be very curious to see the performance of a Bayes classifier derived from the model trained with a pure generative cost function (with parts of or all the data labeled).  ** Additional work in auxiliary methods for variational posterior  could have been cited:  Fixed-Form Variational Posterior Approximation through Stochastic Linear Regression, Salimans and Knowles Markov Chain Monte Carlo and Variational Inference: Bridging the Gap, Salimans et al.  Minor: - Figure 4 and 5 look blurry. - Since the focus of the paper is on semi-supervised learning, should the title reflect this? (As pointed out by the authors, there is prior work on generative models with auxiliary variables)

=====