We thank the reviewers for their comments and corrections.$
AR2, AR3, and AR4:
Following the suggestion of AR2, we will illustrate the variance reduction effect of the proposed approach more directly by plotting the variance of the local (VIMCO, Eq. 10) and the global (NVIL, Eq. 8) learning signals as a function of the iteration count and the number of samples used. As these signals modulate the gradients of the proposal distribution, their variance is a convenient parameterization-independent proxy for the variance of the parameter gradients themselves, which is much more challenging to estimate in a black-box manner.


AR2:
While we agree that it would be nice to apply VIMCO to other types of models, we left that as future work and concentrated on SBNs in the paper because these are the most commonly used discrete latent variable architectures in the recent papers on variational training of Helmholtz-machine-like models. And although we did report results only on the MNIST dataset, we used it for two distinct if related tasks: generative modelling and structured output prediction. We also performed structured output prediction experiments using SBNs with a Gaussian observation model on the Toronto Face Dataset (following the setup of Raiko et al. 2015) but did not include the results in the paper as the relative performance of the methods was the same as on MNIST. We will include these results in the supplementary material.

NVIL performance degrades as we increase the number of samples in the multi-sample objective because this increases the variance of the learning signal, eventually more than counteracting the benefit of using a multi-sample objective. This is a consequence of NVIL scoring all k samples in a set jointly, in contrast to VIMCO which performs credit assignment among the k samples. As a result, the value of the NVIL learning signal for the given sample can depend heavily on the other k-1 samples, making its variance higher and more dependent on k. For example, if only one of the samples explains the observation well, all k samples get equal credit for it.

The plots in the paper do indeed show the smoothed estimates of the multi-sample bound. These were obtained by computing the bound on 9600 (400*24) observations sampled randomly from the validation set after every 1000 parameter updates and smoothing the resulting estimates using a local spline fit to the nearest 10 estimates (using the “loess” method of geom_smooth in ggplot). We will document this procedure in detail in the supplementary material.


AR3:
We have experimented with averaging over multiple samples for NVIL with a single-sample objective before but saw only negligible performance gains. As a result we concluded that it is better to use multiple samples as a part of a multi-sample objective and concentrated on this approach in the paper. We will report the results with multi-sample averaging for the single-sample objective in the final version of the paper.

As for determining the cause of the reported gains, Table 1 shows that the gains are due to variance reduction. After all, both VIMCO and NVIL use unbiased gradient estimators of the same multi-sample bound (when using the same number of samples) and the only difference between them is the variance reduction technique used.

The per-iteration runtime is essentially the same for VIMCO, NVIL, and RWS when using the same number of samples. Thus there is virtually no difference between evaluating results using runtime or number of interactions.

Although global random variable case is an interesting one, this is not something we had in mind in this paper. While in principle the proposed approach can be applied to global random variables (at the expense of k times the computation), this is indeed not something the paper addresses. How to do this efficiently remains an interesting direction for investigation. We will change the wording to make this clear. 


AR4:
Thank you for pointing out the typos.

To illustrate the behavior of the proposed estimator more directly, we will include the variance plots of the local and global learning signals as described in the first section above.

We will also include an intuitive description of the learning rule given by Eq. 10 along the lines you suggested.