We thank the reviewers for their time and constructive feedback. All the reviewers accepted the core idea of our paper, finding it to be significant and clearly explained. Quoting R1:$ >The overall approach is convincing and its presentation strong. First, we emphasize the computational difficulties of the models in our experiments. We focus on models with discrete latent variables, which are significantly more difficult than models with only differentiable latent variables. Further, one of the classes of models we consider, the Poisson deep exponential family, has discrete countable support. Because of difficulties navigating the space during optimization, these models are rarely considered in the variational inference literature. We use typical data and benchmarks, as our focus is on posterior inference, and we demonstrate success on models with complicated discrete posteriors with finite support (sigmoid belief nets) and infinite support (Poisson DEFs). Lastly, we thank R1 for the experimental suggestions. We will add convergence plots, explorative analysis of the posterior approximations, and discuss the differences in multi-layer models of Science and NYT. We now address individual comments. >R1 comments on how [1] and mixtures apply in this general setup, and also their applicability for discrete latent variables. We'd like to mention one difference regarding the mapping spaces. [1] is presented in light of MCMC stochastic transition operators (see section 1.2). These operators randomly map from a space Z to a space Z. Our approach differs in that we map to spaces that may vary in dimensions and type (continuous to discrete). Similarly the proposed characterization of the mixture model requires transitions from discrete space to continuous space to latent variable space. Regarding discrete samplers, an off-the-shelf version can work in principle but may lead to unreasonably high variances. Key to our approach for discrete variables is the hierarchy on the mean-field. This allows for the variance reduction present in mean-field approximations to carry over to hierarchical variational models (c.f., Rao-Blackwellization in Black Box Variational Inference (Ranganath et al., 2014)). One could argue an approach such as Local Expectation Gradients for Black Box Variational Inference (Titsias and Lazaro-Gredilla, 2015) could mitigate this problem, but that approach is not feasible without further approximation in the latent Poisson models we consider. We will add more discussion of [1] in the paper. >R1 mentions additional references for the hierarchical ELBO. Thank you. We will add and discuss them. >R1 asks about the comments on line 464-470. If r(lambda|z)=q(lambda), the hierarchical ELBO reduces to an expectation of the mean-field ELBO with respect to q(lambda) (by line 460). This expectation is bounded by the maximum value achieved by the mean-field ELBO; therefore q(lambda) will choose parameters which always achieve this value, reducing to a point mass. If multiple global optima exist, then q(lambda) may be any average of the point masses achieving the global optima. As an example, consider a 2-dimensional Bernoulli with probability table p(x=1,y=1) = p(0,0) = 0.01; p(1,0) = 0.51; p(0,1) = 0.47. The optimal variational model when r does not condition on z has r(lambda|z)=q(lambda) and q(lambda) set as a point mass on the optimal mean-field solution, q(x=1) = .977 and q(y=1) = .023. Thus, while the correlation is informative, the HVM cannot find it when r does not depend on z. >R1 ask about the value of deriving the equivalence of the reparameterized and score function estimates. This is in the appendix because we feel the literature is light on theoretical discussion of two estimators. This sheds some light on where the efficiency of reparameterization comes from, and as far as we know there are no general references for this. >R1 asks about the value of deriving an upper bound on the entropy. This is in the appendix. While we do not explore it in this work, we believe these upper bounds will be useful for sandwich estimating the entropy for q(z) and thus sandwich estimating the ELBO. They also open the line in thinking about alternative divergences, as it enables different properties we may desire from the auxiliary approximation. (It is a follow-up research topic.) >R2 asks when the approach may work well ("high variance" problem) and about the results in the experiments. We mitigate the high variance problem for discrete variables by applying differentiable priors on top of mean-field approximations; this maintains the advantage of variance reductions (see above on Rao-Blackwellization). As the construction of variational models is orthogonal to variance reduction techniques, work such as "Local Expectation Gradients" may also be applied. Regarding experiments, please see above.