We thank all the reviewers for providing detailed comments and directions for future work. We address the main concerns below.$
Reviewer_3:  
Due to space limitations, we omitted a conclusion section from our submission but included it in the supplementary file. Prioritizing significance over power is to prevent analysts from drawing false conclusions.  This is the same reason that many statisticians advocate using “exact” or “conservative” tests in small sample sizes, to avoid false discoveries when asymptotic approximations may not be valid.

Empirically, all of our tests achieve at least the target 1-alpha significance, but at a loss of power. We point out in the supplementary file that in order to achieve a fixed level of power, it suffices to have an additional 3000 samples for our DP tests with Laplace noise compared with the non-private version.  We found this to be surprisingly good -- typically one expects DP to require the sample size to blow up by a multiplicative 1/eps; we see a better performance because noise is dominated by the sampling error.   We will include these discussions in the camera-ready version.

The previous works mentioned in lines 134 and 143 do not focus on GOF or independence tests. However, with relatively large datasets that they consider (n< 2000) and smallest privacy parameter that they choose (eps = 0.1), Figures 1 and 2 show that the empirical significance of the classical tests with the noisy data is way below the target 95%, even for symmetric hypotheses.

We do not empirically test the significance of MCGOF because we prove in Theorem 5.3 that it has a significance at least 1-alpha.

For our small sample sizes in independence testing, the empirical significance being nearly 1 means that we are essentially never rejecting under the null hypothesis. This satisfies our requirement for significance, but comes at the price of very low power for such sample sizes.  As we show in the right plot of Figure 4, the non-private tests also experience the same phenomenon at small enough samples.

The critical value (or threshold) is the point in which a test will reject the null hypothesis if the test statistic falls above it; we will change the y-axis label in Figure 3 to “Critical Value” in the final version.

Reviewer_4:
Due to space, we omitted all proofs, which can be found in the supplementary file, including that our MC based methods for GOF have guaranteed significance at least 1-alpha and an analysis of the asymptotic distribution under the null for Gaussian noise.  Gaussian noise is convenient because we could use existing R packages to numerically find the tail probabilities of quadratic forms of normals. A similar result with Laplace noise would easily follow from our analysis by making W = (U, V) where the noise V is now Lap(2/eps), instead of Gaussian, and U is the same as in our paper. Our private chi-squared statistic would again be the quadratic term W^T A W, but now W itself is not multivariate normal.  In fact, since our submission, Wang, et al. have given this same asymptotic distribution for the chi-squared statistic with Laplace noise. We are not aware of any numerical techniques to find the tail probabilities of this distribution. Hence, one may have to resort to MC methods (as Wang et al. did) to calculate critical values. In such a case, our MCGOF_Lap test would be a better test to use since it guarantees at least 1-alpha significance and does not rely on asymptotics.

Determining the threshold (effective) sample size for a target level of power is definitely a very important area of study. In the supplementary version, we give the asymptotic distribution of the chi-squared statistic for the alternate hypothesis we consider in Section 8.  With this distribution, it may be possible to compute the effective sample size and we see this as a fruitful area of future work.

Our methods do apply to kxk tables (or kxl tables) with run time poly in k plus the time for the iterative Imhof method with fixed accuracy in the R package CompQuadForm.

Regarding testing real data, we do not know the ground truth, so it is unclear how to evaluate the results. If the non-private test does not reject the null, this does not imply that the variables are actually independent.  Further, if we look at only results that reject the null, then we must be careful about false discovery/multiple hypothesis testing issues (the null hypothesis may hold in more than an alpha fraction of them).

We will fix the citation references with multiple brackets and make the characters in the plots easier to read in the final version.

Reviewer_6:
Applying the exponential mechanism is an interesting direction to explore for a DP hypothesis test. A difficulty with this approach is that if power is to be the score function then an alternate hypothesis must be fixed ahead of time, and it is not clear which of the (infinitely many) alternatives should be used.