We thank all the reviewers for their time. $
 “R2: One can always center the kernel …”: To center the kernel, one would need to evaluate the expectation of the kernel under the given distribution and this often requires MCMC approximation if the distribution is high dimensional and complex. The role of Stein’s identity is to allow us to construct degenerate kernels without sampling. This computational advantage distinguishes our method from the traditional goodness of fit (GOF) tests, as well as the MMD-based kernel two-sample tests, which also require MCMC approximation when applied to GOF tests.  

“R2&amp; R6: What is the statistical power of our test, and when it is better than the other methods…”: Our empirical results on 1D cases show that our method is comparable or better than all the baselines, and we leave a theoretical study of the statistical power (including its dependency on dimension as pointed by R6) as future work. 

“R6: how bandwidth influences the result”: Empirically, we find that setting the bandwidth to be the median distance of the data always works well and is computationally efficient. We also have some theoretical intuition that a good kernel should have a small value of ||s_p-s_q||_H^d as implied by Eq18, but more concrete results need to be established in the future. 

“R6&amp;R8: Non i.i.d. setting”: Yes, it can be addressed with a straightforward adoption of related results on U- &amp; V- statistics, and we will study it in our extended version. 

R8: Fisher divergence is not as computationally tractable/convenient as KSD (see the discussion in the last two paragraphs of Section 5.1). 

“R7&amp;R8: plot AUC curve /type I &amp; type II errors”: We would like to include these figures in our final version. We did check our type I and type II errors, and found that the type I errors match well with the 0.05 level and mostly remains to be a constant, so the overall error rate is basically type II error. 

==R7:  
We highly appreciate R7 for his unusually throughout review and constructive suggestions. We will carefully address all the points, but could only response to the major points here due to the character limit. 

“Related work”: Oates etal is highly relevant and we will certainly cite and discuss this work. They clearly made the connection between RKHS and Stein’s earlier than us (for an interestingly different problem), and we will clearly acknowledge their contribution. Meanwhile, our contribution remains to be solid: we provide a (probably the first) GOF test for distributions with intractable normalization constants that is widely usable. Our derivation for the RKHS-Stein combination is also different (via Definition 3.1), which helps clarify its connection to Fisher divergence. 

We will discuss Gorham &amp; Mackey 15 after Eq 11, and correct the statement in L097. References to Stein’s method &amp; identity will also be corrected. 

“Proposition 3.2 &amp; L295”: Yes, this condition is too restrictive and there was some cavity in the proof. But it is very easy to fix: as directly implied by the definition of KSD in Eq6, one sufficient condition could be that k(x,x’) is integrally strictly positive definition (\int k(x,y)g(x)g(y)dxdy>0 for any g\in L_2 / {0}, which Gaussian RBF satisfies), and g \defeq = p*(s_q - s_p) \in L_2^d (under Lebesgue measure). This would address the counter example you provide because it is multiplied by p(x) which decays to zero (and we missed this term in our proof). One sufficient condition for getting p*(s_q - s_p) \in L_2^d  is that p is bounded and the Fisher divergence is finite. This modification would not have significant influence on Thm 4.1 &amp; 4.2, but we do need to keep the condition (s_q-s_p) \in \H^d in Thm 5.2 2) since otherwise Inequality 18 would be vacuous. 

“Multivariate GOF by Chiu &amp; Liu”: We will experiment with Chiu &amp; Liu or similar methods,  but we also want to point out that, like other traditional multivariate tests, Chiu Liu requires additional approximation for complex models (mainly for calculating the marginal CDFs in his Section 2), and is not directly usable in practical machine learning models with intractable normalization constants. 

“LR-AIS”: We found that the AIS log likelihood approximation empirically performs well because the likelihood ratio under the null/alternative cases are reasonably separated (Fig 2h). For Fig 2a, there is no existing GOF test based on AIS that we can use. 

“MMD for GOF test”: Fig 1 &amp; 2 showed the results when approximating the kernel expectation of MMD with Monte Carlo sample size of 1000; in both cases, we believe that the results are already very close to that with exact expectation; this can be verified with Fig 2b, where we showed that the MMD result stops to change as the MC sample size >=100. 

Yes, KSD can be viewed as a MMD with a q-specific RKHS whose kernel is u_q(x,x'). In fact, we meant to discuss this in our journal version.