We’d like to thank Reviewers 2 and 3 for their positive comments. To respond to some specific points:$
[Reviewer 1]

> I am still confused if the main point of the paper…

We are sorry that Reviewer 1 found the paper unclear. The third paragraph of the introduction is intended to outline the overall goal, which we expand on in more detail at the start of section 3. The intent was for section 2 to convey background material and related work, with the novel procedure for learning proposal distributions described in section 3. We can revise to clarify the contribution.


[Reviewer 2]

> Specifically to me the most important questions in this line of work is how good are prior simulations for a realization of data in high dimensions. It seems like learning the posterior for all data by simulating from the prior is harder than learning the posterior for just the observed data.


It is true this approach attempts to learn the *entire* posterior, and so will be most efficient when samples from the generative model yield plausible synthetic data. Otherwise, some effort will be wasted in learning to approximate posteriors conditioned on values which do not correspond to *any* dataset we may one day encounter.

However, to some extent that is a complementary problem of model selection, rather than inference. One interesting approach given a decent supply of representative training data, and a very broad potential model family, could be to use the training data to fit model hyperparameters in an “empirical Bayes” sense. This could be combined with our approach to amortizing inference over the remaining latent variables.


> I am not sure about the claim KL is intrinsically good; more detail is needed. What about …

This is an interesting suggestion! Motivated by adaptive importance sampling, then either this direction KL (or the Chi-squared distance) are the natural / standard choices of objective functions. However, it is certainly worth thinking about whether other options may perform better.


[Reviewer 3]

> This paper was enjoyable to read!

Thanks!


> How important is the inverse model factorization for each of the model used?

One thing we’d like to point out is that this inverse factorization can actually lead to a simpler learning task, than if we were to assume (say) the posterior fully factorizes when conditioned on the data. In the hierarchical Poisson model, for example, the latent global parameters {α, β} depend on the data only through the local latent variables θ, which have lower dimension than the full set of observed data.

This is in addition to taking advantage of the factorization across the local latent variables to re-use a single learned “local” inverse for q(θ_n|y_n,t_n).