Paper ID: 168
Title: Hierarchical Variational Models

Review #1
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): This paper proposes construction of variational distribution by using a hierarchical model such that the lower bound is tractable to compute but is more accurate. An addition lower bound is introduced to deal with hierarchy, and a tractable inference method is proposed by using ideas like normalizing flow and variance reduction methods.

Clarity - Justification: The paper reads well and is easy to understand, except one part: The section on optimizing HVMs, specially the part on local learning with r is a bit confusing. The problem and solutions in that section are unclear. I think it will be great to add a running example, so that the reader can understand why the third term in Eq. 8 is difficult and how Eq 9 is solving the issue.

Significance - Justification: The problem is very relevant. For variational inference to be useful in practice, q should be allowed to be complex without increasing the complexity of the inference. This paper is addressing this problem which is a significant problem.  The ideas presented in the paper are also good ideas and are significance, but the experiments are lacking.

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The authors have used the ideas from latent variable model to put a prior distribution on the variational parameters, which is a good idea and might be useful.  One issue is with clarity of how general the proposed inference method is, and when is it supposed to work well (e.g. high variance problem).  Another issue is that the results are limited and are not conclusive.  Please comment on the above two issues.

=====

Review #2
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): This paper presents hierarchical variational models to address the "independence" drawback of mean-field approximations. This new model can capture posterior dependencies between latent variables.

Clarity - Justification: In general this paper is very well written. 

Significance - Justification: Variational inference is an important area in machine learning and the dominant mean-field family models each latent variable independently. The ability to capture posterior dependencies between latent variables is quite useful.

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): This paper is generally well motivated and well written. My main concern is that the experiments on real data is too simplistic (text perplexity). I would suggest some more challenging tasks and a variety of different tasks.

=====

Review #3
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): This paper uses the auxiliary variable technique in the inference model to create richer posteriors for blackbox/reparametrized variational inference in complex generative models. 

Clarity - Justification: I have no major comment regarding clarity. The writing is excellent and the paper is easy to follow.  Minor qualms:  * space allowing, more investigation on the numerical experiments would have been welcome. We find out that multi-layer stochastic models improve a lot on science but not on nytimes. An attempt at explaining that behavior would have been interesting. * I would have used different notation for the for the source of randomness behind the reparametrization of z and \lambda (they are both denoted \epsilon). 

Significance - Justification: The overall approach is convincing and its presentation strong. At a high level, the inference model (where the improvements are found) can be described as the marriage of two ideas: normalizing flows [4], and inference models with auxiliary variables [1-3].  The inference model uses a normalizing flow which outputs stochastic auxiliary variables. Those serve as 'parameters' to the actual latent variables in the model.  The critical bound (5) is essentially identical to equation (3) in [1] (the earliest form of that bound I could find is the "Agakov-Barber bound" in [2] - it is derived for an undirected model, but the translation to a directed model is easy). As such, I feel that perhaps more credit could have been given to previous work. [1] uses the bound for a particular inference model which resembles HMC, but this difference is not relevant for the high level idea. It is true that [1] uses it for a continuous variable model, but equation (3) in [1] does not rely on it. The use of score function estimators for discrete samplers is at this point well known.  Furthermore, the authors argue that there is a difference in that the mixture model is applied to the parameters instead of the latent variables z, but for instance, if one uses the method of [1] and T is set to at least 2, the 'parameters' (z_1) of the actual latent variables z(=z_2) will in fact be a mixture as well (with mixture components given by the first level z_0).  Nevertheless, the idea of 'capping' normalizing flows with another stochastic layer (in particular, allowing discrete variable as the 'final' output, the actual latent variable) is novel and interesting.

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): * It seems to me that the comments on line 464-470 are not quite accurate. While it is true that convex combinations cannot be sharper than the original components, the original components (per figure 1a) were restricted to be independent. Correlations in the variational parameters \lambda, even with the simpler bound obtained with r(\lambda|z;\theta)=q(\lambda;\theta) can still be more informative than a true mean field bound (for instance, consider a generative model where all latents are by construction highly likely to take the same value, for instance a ferromagnetic Ising model).  * I liked that this paper links the 'doubly variational' approach to the 'auxiliary variable in generative model' approach taken by [3,5] and essentially explains their equality.  * I found the proof of the equivalence between the reparametrized and score function estimates of the gradients interesting (it generalizes the use of the integration by part in the proof of Price's theorem for Gaussians [6]). However the equality is already known (and is quite a bit easier to prove in the general case using the law of the unconscious statistician); it is not clear what value it adds here.  * Similarly, the derivation of the upper bound on the entropy of z does not seem to add much to the paper as it is. The connection to EP is not expanded on, and in fact, when considering auxiliary variational inference and naively applying the VI inequality to log q(z) (i.e. using that upper bound) leads nowhere in obtaining the hierarchical ELBO.  * The numerical experiments section would have gained from being expanded on a bit. We only get final estimates of perplexity on standard datasets. Either more detailed analysis on those datasets (speed of convergence, qualitative analysis of the latent representation, etc.), or more complex datasets/models would have been interesting.  Minor:  * In equation (6), it sounds like \mathcal{L} should be replaced by \mathcal{L}_MF. For the same reason, that section should probably be called "Stochastic Gradient of the conditional (or MF) ELBO".  * I believe the sign in from the KL (line 76) should be a minus.   [1] Markov Chain Monte Carlo and Variational Inference: Bridging the Gap, Salimans et al. [2] An auxiliary variable method, Agakov and Barber [3] Fixed-Form Variational Posterior Approximation through Stochastic Linear Regression, Salimans and Knowles [4] Variational Inference with Normalizing Flows, Rezende and Mohamed. [5] Auxiliary Deep Generative Models, Maaløe et al. [6] A useful theorem for nonlinear devices having Gaussian inputs, Price.

=====