Paper ID: 1000 Title: Variational Inference for Monte Carlo Objectives Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper presents a variance reduction for multi-sample Monte Carlo objective functions and particularly the tighter variational lower bound proposed recently by Burda et at. The trick used is explained in section 2.5.3 and it is based on constructing sample-specific control variates for the parameters of the proposal/variational distribution using a leave-one-out procedure. That is the subtracted baseline for the sample h_j is obtained by using the remaining samples. Clarity - Justification: The paper is well written and the idea is clearly explained. The authors spend a lot of time reviewing previous work, but somehow this turns out to be useful and makes the proposed trick more clearly understood. The paper contains a significant amount of typos which the authors could easily correct through a couple of careful readings. Significance - Justification: The proposed method is nice. Given the increased popularity of Monte Carlo objectives and deep latent variable models I believe the idea is quite significant and useful. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Methodologically the paper has a clear contribution which boils down to this leave-one-out construction of the control variates for the stochastic gradients. The really weak point in the paper is the experiments. The experiments are very basic showing only a couple a plots with the lower bounds and only results in MNIST using a sigmoid belief net. The method you are proposing seems to be quite general, it would be far more convincing if you could show its applicability to a different model than a SBN. It is not clear what happens with the variance reduction? It would be better if you plot the variance of some parameters of the proposal across iterations and also for different sample-sizes. Also what is the reason that makes NVIL to have bad performance as we increase the number of samples? Also the bounds in the Figures do not look so noisy (as someone would expect from a stochastic variational algorithm). Can you explain what exactly is plotted in these figures? is it a smoothed version of the bounds? ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The authors propose a new variational inference algorithm for multi-sample importance sampling objectives that extends recent work by Burda et al. to discrete variables. Their approach is based on recent score-function based stochastic gradients of variational objectives. They propose a pair of variance reduction techniques and demonstrate good results on MNIST and a structured output prediction task. Clarity - Justification: The paper was very nice to read. Significance - Justification: Accurate inference with discrete data is a central problem in latent variable models, but I think the work would be more significant with a little more care in the experiments and restructuring of the text. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The key to author’s approach is the following: since the objective has multiple samples, the samples can be used in a leave-one-out manner to reduce the variance in an unbiased manner. This on the fly Monte Carlo “centering” is better than the function approximation in NVIL. This is clear form the results (Figure 1), but this does not come across in the text. More generally, it seems like one could take the “single sample objective” and use 50 samples to reduce the variance of the “single sample objective” in the manner proposed in the paper. This makes it hard to determine how much of the value comes from variance reduction and how much comes from the new objective. Maybe the paper should be restructured as 1) General variance reduction with multiple samples 2) Results on single sample objectives 3) Multi sample objectives (which naturally have multiple samples) 4) Results on multi-sample objectives. Next, how do the results compare on runtime rather than iteration count? Lastly, the claims about generality are a bit too strong. The proposed approach is general for models that factorize across their observations (i.e. no global random variables). The math seems to follow for the bound in the presence of global random variables, but getting unbiased gradients via data subsampling breaks down as sum prevents the log from splitting the product. Changing the wording would help with this. ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper introduces a new unbiased estimate of the lower bound on the marginal likelihood of latent-variable models as estimated by multi-sample variational estimators. Clarity - Justification: Reading this paper was a pleasure. The authors clearly lay out the issues in plain language. It's probably only accessible to people who have thought about variational inference before, though, because the intro doesn't contain a concrete example. The related literature was covered well, and the ideas were gradually built up with fairly clear notation. The use of theta and psi for all parameters and non-variational parameters was a bit nonstandard. I think the title could be clearer, though. The current title could really be talking about any number of related things. I would love a sentence or two explaining (10) in plain english. Like "the difference between what we would expect to get based on the other samples and what we got on this particular sample" Significance - Justification: The proposed method is suitable for training models with discrete latent variables - and extremely general and important setting. This method is also simple and contains standard methods as a special case. It doesn't introduce extra hyperparameters (other than the number of samples to use). The field is also in need of a review paper on variational autoencoders. This paper goes some way towards that. One thing that would have made this paper a bit better would be a deep dive into a concrete example showing how the new estimator behaves in detail (besides overall performance, which they do measure). I'm not sure how I would go about this , however. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Typos: Line 552 - "that" -> "than" Figure two right has a mistake in the legend - some of the lines should be dashed. =====