Rev. 1$To use MMD one needs to know P, while for our test it is sufficient to know P up to the normalizing constant. Thus the MMD test can’t be applied for MCMC convergence diagnostics. Rev. 2 Indeed, a chain is not stationary during the convergence phase, but E S(X, F, p)^2 > 0 for all X_t and so the kernel of the normalized V-statistic is not degenerate. We will try to show that in such a case normalized U-statistic can not converge to a random variable, as the positive tendencies of S(X, F, p)^2 accumulate over t. Rev. 3 We now cite Oates et al. Lem. 5.1 first appeared in Oates, so we now reference it. Regarding Lem. 3.1, we do not believe this was covered by Oates: the key result in their Thm. 1 is the construction of a new RKHS with kernel k_0, with associated RKHS having zero expectation under target \pi (or p in our case). Our Lem. 3.1 defines a random element \xi in the larger RKHS \Fcal_d, and shows it is Bochner integrable, hence expectation can be exchanged with the inner product. Key is that unlike Oates, the distribution q(Z) need not be the same distribution p(X) used to define the Stein operator. Specifically, unlike Oates, Bochner integrability for the r.v. \xi(Z) in \Fcal_d requires a suitable condition on the expectation of \nabla log p(Z) (revised assumption ii below). GP expt, Lloyd & Ghahramani comparison We ran a comparison experiment on the GP example. Following L&G, using the same kernel parameters, we compute the MMD statistic between test data (X_test, y_test) and (X_test, y_rep), where y_rep are samples from the fitted GP. The null distribution can easily be sampled from via repeated simulations from the GP (though this has a large computational cost). For |y_rep|=|y_test|, our goodness of fit test produces a p-value of 0.007, where the MMD method achieves 0.05. We thus provide an order of magnitude improvement, with the further advantage that we do not require to sample from the GP. Assumptions ii: we have relaxed it to E | \nabla \log p(Z) |_1 < \infty, where Z is distributed according to the alternative distribution q. We’ll make clear that if (p,q) does not satisfy this assumption e.g. Gaussian p and Cauchy q on unbounded domains, it should not be used in the test; i,iii: we relaxed Lipschitz continuity to local Lipschitz continuity; iv: relaxed to E|d k(Z,Z)/dx_i dx_{i+dim}| < \infty. The assumptions require that Z \sim q has moments of magnitude of \nabla \log p(x) (and iv). Dim > 2 and comparison with other one sample test: We compare with Baringhaus and Henze. Null: X is d-dimensional standard normal. Sample size: 500/1000, a_n=0.5. Alternative: X \sim N(0,I_dim); Y \sim Uniform on [0,1]; X_0 +=Y; // modify only first dimension. Table shows power vs sample size. Indeed for high dimensions, if the expectation of the kernel exists in closed form, an MMD-type test like B&H is a better choice. d 2 5 10 15 20 25 B&H 1/1 1/1 1/1 0.86/1 0.29/0.87 0.24/0.62 Us 1/1 1/1 0.86/1 0.39/0.77 0.05/0.25 0.05/0.05 a_n and thinning (also L642): We recommend to thin a chain so that Cor(X_t,X_{t-1}) < 0.5, set a_n = 0.1/k, and run test with at least max(500*k,d*100) data points, where k < 10, and d is data dimensionality. L380 fixed. L409 fixed, it should be d/dx f_i(Z). L509 this is rate of tau-mixing. If MC satisfies geometric ergodicity then by the theorem 3.7 ‘Basic Properties of Strong Mixing Conditions ...’ the beta coefficient decays exponentially, which implies that the tau coefficient also decays exponentially. We will provide an extra section in the appendix to discuss a class of MCs for which this test works. Lem. 5.1 Our mistake. We now cite Oates et al. Prop 2 & Remark 2 (we retain our elementary proof after adding explicit assumptions). L798 we could find normalizing constant of the posteriori distribution numerically, sample from it and compare samples obtained in this way with samples generated with eps = 0.08 using two sample test. Or follow L&G. L803 yes. In the online formulation we care about controlling type 2 error (accepting samples from non-stationary distribution) and not so much about type 1 error, which makes it a non-standard setting, and interesting future work. Wild bootstrap parameter: second experiment a_n = 0.5 (data is iid). Third : a_n=0.05. Fourth a_n=0.5 Tau-mixing: 1st experiment yes, 2nd is difficult to prove since Markov kernel is nonstationary. L068,L084 fixed. L071 fixed. Our mistake, the algorithm by G&M is basically quadratic. Clarity: we will edit in accordance with suggestions. L. 110 Fixed. L.275 Yes. L. 282 yes, tau mixing. L.289 citation is to Leucht: null is an infinite weighted sum of dependent chi squared variables. L. 305 yes. L.313 yes. L. 314 asymptotic power. L.337 etc. <\infty. Assumption v. yes, but we explicitly mention symmetry as a requirement of Leucht (we’ll point out that kernels satisfy it by definition). L. 798 yes.