Thank you for your comments and feedback. There are a few points we would like to clarify and address regarding our paper.$ First of all, we would like to clarify concerns about Theorems 3.3 and 3.4. Note that the marginal probabilities are unaffected by not resampling fresh Gumbel variates every time, instead, the samples become dependent but they still have the correct marginal distribution. Thus, these theorems don’t change by not resampling since the theorems only make claims about the marginals. Further, note that Theorem 3.3 makes no claim about Z, only on the relationship between the two estimators (random variables) \hat{Z} and \tilde{Z}. By the definition of the estimator, \exp(-H) = \hat{Z}^{-1}. \tilde{Z} is defined similarly except that H is replaced by \hat{H}, which is computed using approximate MIPS instead of exact MIPS. Intuitively, Theorem 3.3 is a bound on how much using approximate MIPS instead of exact MIPS affects the estimator, not on the relationship between \tilde{Z} and Z. However, because E[\hat{Z}^{-1}] = Z^{-1} (cf. Theorem 3.1), we can also take the expectation of \hat{Z}^{-1} <= \tilde{Z}^{-1} <= e^c \hat{Z}^{-1} (Theorem 3.3) to bound the bias of the estimator \tilde{Z}^{-1}. We will make this clearer in the final version. We would like to clarify the use-case of our method. As the reviewers pointed out, our method does not scale to an exponentially large number of states (e.g. 2^100), but instead is intended to scale to a discrete space with a moderately large number (e.g. millions or hundreds of millions) of states. Clearly, such state spaces can indeed be handled by brute force -- our goal is to do better than brute force in an amortized sense. This was mentioned in section 2.1.1, but would be emphasized more if accepted. On another point, our method only offers improvements when we wish to sample from the same set of states for various values of the parameters theta. Thus, although the preprocessing time is \Omega( Nkt ) to sample the Gumbel variables, this is amortized over an arbitrarily large number of queries. Thus, it seems that it is correct to say that the runtime is O(f(N) t). The empirical evaluation of our method was difficult to design because of the generality of log-linear models. In this way, the contribution of our paper was chiefly theoretical and the goal of the experiments was to demonstrate that our method is practical and can provide speedups in some cases. We attempted to refrain from quantities that were very problem dependent. The preprocessing time is amortized over an arbitrarily large number of queries that is very problem dependent. Thus, the extent of the dependence between samples is a very important consideration but is unfortunately very problem specific. A plot of error for various values of k would be helpful for evaluating the nature of the dependence between samples and would be included in a final version if accepted. We did not mention the runtime for the algorithm with exact MIPS because it will be t times slower than the exact method and have the dependent Gumbel samples. We showed the results of exact MIPS to demonstrate that there was not a significant loss of accuracy by using our method; it would never be run otherwise.