We would like to thank the reviewers for their overall positive and constructive comments.$
ANSWER TO REVIEWER 2
2.1 "The equation on line 134, that p(x,z,a) = p(a | x, z)p(x, z), seems vacuous: it must be so by the rules of probability (product rule)."

The point is that p(x,z)=p(x|z)p(z) is chosen to be the VAE generative model. The chosen decomposition of p(x,z,a) leaves this unaltered. If we had chosen another decomposition where x and/or z were conditioned on a then it would instead have been a two stochastic layers model as discussed in the paper.
 
2.2 "In terms of the derivation, it's a pretty straightforward application of "Hierarchical variational models" (Ranganath, et al 2015) to deep generative models."

As we point out in the paper the use of auxiliary variables for variational inference in machine learning goes back to at least Agakov and Barber, 2004. We also published a preliminary version of our work simultaneously with Ranganath, et al 2015 that we could not cite in the paper due to blinding.
 
2.3 "The paper "Variational Gaussian processes" (Tran, et al., ICLR 2016) goes further than either paper, introducing a variational distribution that's flexible enough to match any posterior, and that produces state-of-the-art results on MNIST (unsupervised) when coupled with the DRAW model. If that work catches on, this work may not end up being widely cited."

Variational Gaussian processes (VGP) presents interesting extensions of variational methods. The use of the auxiliary variable in the variational objective and in the optimization is somewhat different in VGP than our approach so combining the methods is definitely an interesting option. For example, in our approach the auxiliary variable could be replaced by a GP. Alternatively the variational approximation could be improved with normalizing flows (Rezende et. al 2015) or autoregressive inverse flows (IAF) (Kingma et. al 2016).  Currently VGP and IAF appear to be equally appealing frameworks for improving the variational approximation. Whether GPs will become an integral part of deep learning in our opinion depends upon whether the proposed frameworks are robust.  
 
2.4 "Is it necessary to include `a` in the generative model? Could `a` just be used to specify the variational distribution, and get marginalized out before computing the KL divergence between q and p? I think that’s what they do in “Hierarchical Variational Models” (Ranganath, et al.; NIPS approximate inference workshop, 2015). It would be cleaner not to have to change the generative model at all, just to get a more flexible variational distribution."

Good point. In our opinion it is matter of taste whether one prefers ‘a’ in the generative model over marginalized out. We think that the former gives a more “clean” formulation of the variational objective with (at least in some cases) equivalent final objective, see “Note on the equivalence of hierarchical variational models and auxiliary deep generative models” (Brümer 2016).

2.5: "I wonder how the proposed model would compare to a standard VAE with a block-diagonal covariance matrix (a full covariance matrix for each example), rather than a fully factorized one."

Good point. Rezende et. al., 2013 tested diagonal plus rank one approximation to the covariance of the variational distribution. It gave improvements below one nat. We believe that auxiliary and other approaches like normalizing flows give more flexible approximations like also illustrated in the first toy example of our paper.

ANSWER TO REVIEWER 3
3.1 "** I believe that always using the labeled examples in the minibatch amounts to using different weights for the unlabeled and label terms in equation (15) (if we see minibatches as an unbiased subsample of that cost function)."

We agree. We found that always including the labeled data improves the performance. A similar method is used in the Ladder network (Rasmus et.al 2015) where we have also found it important in order to reproduce their results. We will make this point more clear in the final version of the paper. 

3.2 "** As hinted in the conclusion, I would be very curious to see the performance of a Bayes classifier derived from the model trained with a pure generative cost function (with parts of or all the data labeled)."

We will work on presenting results for the Bayes classifier in a follow-up paper.

3.3 "** Additional work in auxiliary methods for variational posterior could have been cited:
Fixed-Form Variational Posterior Approximation through Stochastic Linear Regression, Salimans and Knowles
Markov Chain Monte Carlo and Variational Inference: Bridging the Gap, Salimans et al."

Good point. We’ll cite these papers in the final version. 

ANSWER TO REVIEWER 5
5.1 "Here is a suggested reference that is highly relevant: Variational Gaussian Process, Tran et al. ICLR 2016."

Yes. See above.