We would like to thank all the reviewers for their time and insightful comments and questions. We will correct the typos and adopt many of the stylistic suggestions in the final version. Below we address the main comments in detail.$ Reviewer 1 Although we stress that our estimates of the partition function are free when doing simulated tempering, the reviewer correctly points out that this is also the case in importance sampling methods (including AIS). We will clarify this in the final version. As for comparing the complexity of different methods, the experiments present results where the respective computational complexities have been matched. Such a matching is easy since the main cost comes in all cases from sampling. A theoretical comparison of the rates of convergence of the estimates would require to study the rate at which the central limit theorem converges in each case. While this would give valuable insights, it is beyond the scope of this paper. The idea of treating \beta_k, in line 158, as a random variable (instead of a preselected sequence of values) is at the core of the new method, since the partition function of interest can be obtained from the marginal distribution q(\beta_) of \beta_k, as shown in equation (11). If \beta_k were treated as a fixed sequence, the new method would not work. We will stress this point in the final version. Reviewer 2 We share with the reviewer the surprise that the RTS estimator beats easily AIS, and we feel this result enhances the impact of the paper. Despite AIS being a dominant method in the machine learning community, there is no reason to believe that it is optimal in any sense, and indeed our results show that significant improvements over AIS are possible. As for comparing bias and variance of RTS with AIS, the expressions presented in Sec 2.5 depend on the variance of the \hat{c}_k’s estimates, which in turn depend on the autocorrelations of the MCMC sampler to all orders (see Sup. Material eq (27)). The latter depend on the mixing properties of the sampler used in each model. A theoretical approach to this point would certainly shed light on the reasons why RTS empirically beats AIS, but is beyond the scope of this work. We have performed the suggested experiment on the RBM where the base distribution is a uniform. In this case, we found that the bias from RTS decreases quicker than AIS, and that the RMSE is significantly better until 5e4 Gibbs sweeps. After that point, RTS and AIS converge at similar rates. As expected, all methods perform significantly worse with this base distribution due to poor mismatch between the distributions and poor MCMC mixing. We will include this experiment in the supplement. Reviewer 4 The reviewer seems to believe that the novelty of the RTS method lies in the Rao-Blackwellization (RB) of an otherwise known method to compute partition functions (referred as TS). The paper may have caused this wrong impression by stressing the advantages of RB. But our core new idea is expressed in equation (11), which was never used before to estimate partition functions as far as we know (despite its simplicity). Simulated tempering is a popular method used to sample from multimodal distributions, and our new contribution is its use to estimate partition functions via eq (11). That said, TS alone leads to an ineffective estimator (as illustrated in Figures 1 and 4): Rao-Blackwellizing makes the estimator much more powerful and attractive. We will make these points more clear in the final version. Given that both methods are novel but we only recommend RTS and not TS, we saw no benefit in including TS results in Figure 3. The seemingly good results of TS in Figure 4 up to K=100 are due to the high number of samples used (10,000). When this number diminishes, the quality of TS quickly degrades, as shown in Figure 1. We will correct the typo in eqs (15)-(16). The bias and variance should be of hat{Z}^new, not of hat{Z}. All tempered schemes in 4.1 use the hamiltonian/adaptive step sizes; we will make this explicit in the final version. Reviewers 2 and 4 The “true” value in the RBM experiments was estimated as the average of estimates from AIS and RTS with 10^6 samples from 100 parallel chains. We note the variance of these estimates was very low (≈ 0.006). This explanation was accidentally omitted, and will be included in the final version.