Paper ID: 1122 Title: Bidirectional Helmholtz Machines Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper proposes an original approach to generative modeling based of fitting the geometric mean of approximate inference and generative distributions. The authors deal with the fact inference, learning, and sampling are all intractable in the model by using importance sampling-based approximations. This turns out to be a surprisingly effective approach. The experimental section is extensive and interesting, showing that Bidirectional Helmholtz Machines (BiHMs) are highly competitive with the existing HM-like approaches, such as variational autoencoders. Interestingly, the number of IS samples needed to evaluated the trained models accurately is much lower for BiHMs than for other models, which appears to be a consequence of the peculiar "product of p and q" form of the BiHMs. My main reservation is that BiHMs are effectively undirected models and thus should be compared to models such as Deep Boltzmann Machines. Clarity - Justification: The paper is generally very well written, with the exception of the section on sampling and inpainting (l. 446) could be clearer. Significance - Justification: This is an original and well executed paper. The proposed models seem to perform quite well compared to HM-like models in terms on the log-likelihood numbers, but are a step back in the sense that sampling from them is intractable. As a result, I think these models should be seen as formidable competitors to DBMs, but perhaps not VAEs/SBNs. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The starting point of the paper is to define a generative model in terms of two directed models: one (q) performing approximate inference, mapping observations to latent variables, while the other (p) performing approximate generation. Architecturally the two models look like the recognition and generative parts of the Helmholtz machine, but their function is not quite the same. In the proposed approach the inference model also includes a marginal for the observations (constrained to match that marginal under the approximate generative model p), while the generative model is not actually the "true" model we are interested in learning. The actual model distribution p* is defined by the normalized geometric mean of the joint distributions over the observed and latent variables induced by the two models and can be seen as corresponding to an undirected graphical model defined in terms of two _global_ potential functions. This observation, which is made but not discussed in the paper, immediately points to the difficulties with this approach: inference, learning, and sampling from such a model is intractable. The paper argues that optimizing the product of the two distributions will encourage the true model p* to be close to the approximate inference and generative models used to define it. The explanation of the similarities and differences between this approach and the one based on the standard variational bound is well done. The paper proposes a clever and seemingly novel way of sidestepping the difficulties caused by the intractable partition function by showing that it's upper-bounded by 1, which yields a new lower bound on the log-likelihood. Unfortunately as the bound is still intractable, importance sampling (IS) is used to estimate it as well as its gradients needed for learning. The fact that the algorithm does not require backpropagation through more than one layer as emphasized on l. 375, is not an unusual property however -- it is shared by the vast majority of algorithms for learning directed latent variable models that update their parameters after inferring/sampling the latent variables (e.g. anything EM-like). Unlike such directed models however, the proposed model cannot be sampled from using ancestral sampling, which is a serious drawback. The sampling algorithm for the model given in the paper, which is effectively MCMC, requires monitoring convergence to avoid producing biased samples. The claim in the abstract about orders of magnitude faster inference is very misleading, as inference in HM-like models required a single pass through the recognition model. It would be more accurate to claim that likelihood estimation/model comparison is much faster. I think Table 3 involves comparison of some cherry-picked results for the competing approaches. Are the numbers provided to RWS and NVIL really the best results obtained on MNIST using such algorithms? Given that BiHM effectively trains an undirected model, it makes sense to compare to autoregressive (p and/or q) models trained using those methods (e.g. DARN/AR-SBN) ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper presents a deep generative model which is trained jointly with an inference network. The novel contribution is the use of an objective function which discourages the generative model from producing samples which are hard to explain. The training procedure parallels that of reweighted wake-sleep. The method achieves comparable log-likelihoods to other Helmholtz machine-like models, but allows for much faster evaluation of the model. Clarity - Justification: The paper is well written and easy to follow. Perhaps there could be a bit more discussion of why it's desirable for p to stay close to q, and what sort of problems with standard VAEs this is meant to correct. Significance - Justification: The main contribution is a novel variational lower bound on the log-likelihood which can be optimized with a Helmholtz machine-like architecture. Analogously to variational inference, this is shows to be equivalent to the true log-likelihood minus the Bhattacharyya distance, which has the interesting property that it encourages p to stay close to the inferred posterior. I think the contribution is clever and non-obvious. If I understand right, there are two main benefits of this approach: (1) that the samples under p* look cleaner, since ambiguous samples are discouraged, and (2) the model is much easier to evaluate since less probability mass comes from difficult-to-find explanations. These seem like modest benefits, but they're probably enough to justify publication. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): See above for discussion of significance. The experiments are pretty thorough, comparing against a lot of other models on a lot of datasets. The log-likelihood results are competitive or slightly better compared to other Helmholtz machine-based approaches. I'm glad the authors are making the code available. Minor comments: - typo in Table 3: "Moldel" - why isn't the lower bound given for the IWAE model in Table 3? ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper introduces a way of training a top-down directed generative model, together with approximate bottom-up inference in it, by optimizing a lower bound on the marginal log-likelihood that is distinct from the variational lower bound. The optimization proceeds by gradient ascent with biased (but asymptotically consistent) estimates of the gradient of the objective. The proposed approach is related to a point of view where both the generative model, and the inference model serve as approximate inference networks in an intractable model. The approach is contrasted with existing methods (most relevantly, the Reweighted Wake Sleep), and found to be empirically superior in terms of generative modeling performance (as measured by a lower bound on the log-likelihood), and in terms of the ability to train very deep generative models. Clarity - Justification: The exposition is a little hard to follow, as the introduction of the probability distribution q(x) is implicit, and the model p* is interesting, but in reviewer's opinion doesn't add much understanding. The experiments are reproducible, and the code for reproducing them is provided. Significance - Justification: Training deep generative models with binary units is known to be difficult - in most cases the trained models don't use multiple layers (2 is common, 5 is harder, but possible, as shown in the Reweighted Wake Sleep paper, but models with more layers seem to be increasingly difficult to train). The proposed method allows to train much deeper models (12 layer model is reported in the paper). This is a very interesting contribution - it definitely advances our understanding of what is possible. The paper also introduces a new lower bound on log-likelihood that is distinct from the variational lower bound. It would be interesting to see whether this lower bound has new applications beyond those discussed in the paper. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The reviewer suggest to accept the paper based on a substantial contribution towards training very deep generative models with stochastic binary units, reproducible experimental evidence, and novel ideas on trainable lower bound on log-likelihood. In reviewer's opinion the exposition could be improved significantly by only discussing the following simple idea: \sqrt(p(x)) = \sqrt(E_q p(x,h)/q(h|x)) > = E_q \sqrt(p(x, h)/q(h|x)) from Jensen's inequality applied to the concave function \sqrt. Taking logarithms on both sides gives the proposed lower bound of log p(x) > = log (\sum \sqrt(p(x,h) q(h|x)))^2, which is then optimized by gradient ascent. In reviewer's opinion this is the main idea of the paper, and the model p* only confuses the reader without contributing substantially to understanding (while also hindering making obvious generalizations - instead of the concavity of the \sqrt function, convexity of x^(1/alpha) could have been used, leading to multiple other lower bounds, that might be useful as well). The reviewer doesn't insist on this latter point however, and recommends the paper for publication either in its current form, or after making proposed clarifications. [Addressed in author's rebuttal - thank you] =====