Paper ID: 1356 Title: Inference Networks for Sequential Monte Carlo in Graphical Models Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The authors present a method for learning proposal distributions for sequential Monte Carlo using proposals parameterized via neural networks. Their approach is nice in that all of the inference networks can be trained offline via forward sampling the joint distribution. Clarity - Justification: Very well written. Significance - Justification: Learning how to sample is an important direction in Bayesian inference. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): While the results are promising, they seem limited to low dimensional problems. Specifically to me the most important questions in this line of work is how good are prior simulations for a realization of data in high dimensions. It seems like learning the posterior for all data by simulating from the prior is harder than learning the posterior for just the observed data. Minor Comments: - Section 2.1 should give a choice for gamma to provide a concrete example for the reader -Cite Salakhutidinov and Larochelle “Efficient Learning of Deep Boltzmann Machines” for inference networks as well - I am not sure about the claim KL is intrinsically good; more detail is needed. What about Importance-Weighted Autoencoder Burda, Grosse, Salkhutidinov 2015? Directly learning with an importance sampling objective might make more sense in the context of sequential Monte Carlo. ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The authors present a method for learning efficient (data-dependent) proposal distributions for sequential monte carlo inference. To perform Bayesian inference in a generative model, they describe a method to define the structure of an inverse model (prob latents given observed) that is expressive enough to describe posterior dependencies. They then learn a mapping from observed data to posterior approximations, with the goal that the approximations will be good proposals within an SMC sampler. They apply their method to three models of increasing complexity, and showcase the efficiency of their amortized sampling scheme. Clarity - Justification: Despite including a lot of details, the paper is quite easy to follow. The text is clear and the figures help clarify concepts. Significance - Justification: This paper does a great job of applying inference networks to sequential monte carlo, and shows how efficient it can be. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): This paper was enjoyable to read! Questions: - How important is the inverse model factorization for each of the model used? For instance, how does using a fully factorized approximate distribution compare when estimating the marginal likelihood of the hierarchical model (figure 5)? ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper suggests to use neural network models to design proposals for importance sampling algorithm automatically. Clarity - Justification: This is an interesting but difficult paper to read. The paper juggles with too many subjects/approaches glued together in an eclectic and rather confusing way. Significance - Justification: Hard to tell. I am guessing the main point of the paper IS NOT in developing a well-defined focused tool ... but combining together four or five different methods. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): It would be useful to start the paper from more extensive and self-sufficient description of the overall approach taken (big picture). A scheme of relation between different individual pieces/methods/approaches (reviewed and discussed in Sections 2 and 3) and the main result of the paper would probably help. I am still confused if the main point of the paper is in combining all of these different pieces, or in suggesting one (supposedly missing) link. Of all the pieces - the authors pay much attention to 'defining the inverse model'. Is it the most significant piece? ===== Review #4 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper applies ideas from amortized variational inference to sequential monte carlo, by limiting the inference network to being sequential. Clarity - Justification: The paper is well written and easy to read. Significance - Justification: The idea behind the paper is elegant, but not significantly different from a classical inference network (although trained with reverse KL), with the only added constraint of inference being sequential. Though the paper is ostensibly about learning inference network for Sequential Monte Carlo, the training criterion is not aware of the SMC step (it would be difficult since computing the actual distribution of the posterior under SMC would be hard); this is similar to the idea of using VAE inference network as proposal distributions for importance sampling (though they were not trained with an IS criterion, IWAE model put aside). Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The paper suggests training the model with a reverse KL. This is a form of 'reverse autoencoder': generate code and data from the decoder, and recover the code with a supervised, probabilistic criterion. This is presented as an advantage since it effectively allows infinite amounts of training data; furthermore, it allows to amortize the inference for a particular model for any dataset (as opposed to the classical KL cost, which amortizes inference across datapoints for a particular model and particular dataset). However, this is not necessarily a good idea, for two reasons. First, it requires trusting your model a lot. For example, any problem for which the model serves as a weak inductive bias for which the actual data of interest does not look like sampled data under the prior, the learned inference is going to be quite weak. Second, it asks a lot of the inference network, which has to be good under the entire prior distribution, which is likely to be much wider than the actual data of interest (at the very least for a model with global variables with strong impact on the 'location' of the data). Take for instance a bayesian VAE (with global latent variables as parameters of the decoder, and local variables as the source of randomness z). Trained under this paper criterion, the inference network would amortize inference over the decoder parameters for *any* dataset, be it MNIST, OMNIGLOT, etc. It would have been therefore important in my mind to highlight example where the data was not sampled under the forward model. The model 4.2 is an example of it, but it is overly simple and the global variables have relatively limited influence on the data - most on the inference is on the local variables \theta_n. Models 4.1 and 4.3, as I understand, were tested with data from the forward model. Note also the idea of training an 'encoder' under synthetic data from the prior to be used for proposal distribution (though not in a SMC scheme) is also present in "Picture: A Probabilistic Programming Language for Scene Perception" (Kulkarni et al.), which is cited but not referenced in the text. - A point of mild confusion - are the inference network intended to be only filtering network? As presented, it seems the posterior should be on the entire data, and so in general the inverse graph should always have all latent depend at least on the entire data (proposition 1 does not address this, since the conditioning statement for the latents should be given the entire vector y). For instance, even knowing the parameters of an HMM exactly, the posterior on the first latent z_1 does depend on all observations (but figure 3b-c make the posteriors look like filtering networks) =====