Paper ID: 1177 Title: A Kernel Test of Goodness of Fit Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The authors use the multivariate Langevin Stein discrepancy of Gorham & Mackey, 2015 with an RKHS-based Stein set and the wild bootstrap of Shao, 2010 to produce a new one-sample goodness of fit test for tau-mixing data with O(n^2) computational complexity when n is the sample size. Under some very restrictive assumptions on the score functions of densities being compared, they use RKHS theory, integration by parts, and the standard wild bootstrap results of Leucht, 2012 to verify that their statistic is asymptotically measure determining with power 1. The authors apply their test to the evaluation of model fits on held-out data, the selection of a sufficient number of data points or random features for nonparametric density estimation, and the evaluation of approximate MCMC samples. Clarity - Justification: The paper on the whole is clearly written, but please note the following opportunities to improve clarity: -L110: Here the statistic is called a U-statistic, but elsewhere it is called a V-statistic. -Thm. 2.1: The multiple forward references in this section make it difficult to parse Theorem 2.1. Is it possible to introduce some of the definitions and assumptions earlier? For example, script P has not been defined. -L275: Am I correct that Z and Z’ in the expression Eh(Z,Z’) are assumed to be independent? -L282: Are any assumptions being made about the relationship amongst the Z_i’s? -L289: “has no computable closed form” Could a citation be provided for this claim? -L305: “bootstrapped V-statistic” In what sense is B_n bootstrapped? Is each bootstrap replicate obtained by resampling the W’s? -L313: Does “V_n dominates B_n” mean that V_n is larger than B_n? -L314: Is “almost sure” here being used in the sense of probability 1? Is the claim that this test always has power 1 (for finite sample size n)? If this is a claim about asymptotic power, that should be clarified for the reader. -Sec. 2.2: Despite the warning at the start of Sec. 2, I did not realize in the midst of Sec. 2.2 that the claims being made would be established rigorously in the next section. It might be helpful to remind the reader that proofs of these statements will follow. -L337, L341, L482, L509: Am I correct that by \leq \infty in these instances, the authors mean < \infty? (The assumptions are used assuming these quantities are less than infinity in Lemma 3.2 and Thm. 2.1.) -Assump. (v): Does symmetry need to be explicitly assumed here? Does symmetry not follow from the definition of k being a kernel? -Lem. 3.1: It may be clearer to repeat the definition of \xi(x,\dot) here. -L352: Could the statement of Lem. 5.1 be included in the text? I think this would be valuable for the reader. -L798: Should 0.09 be 0.08? Significance - Justification: -Does the paper contribute a major breakthrough or an incremental advance? The main contribution of this work is the combination of several threads of research (Gorham & Mackey’s work on measuring sample quality via Stein discrepancies, the uncited work of Oates et al. on combining Stein operators and RKHS theory, and the wild bootstrap work of Shao and others for non-i.i.d. data) to produce a valid one-sample goodness of fit test for tau-mixing data (under certain assumptions), so while the work is more of an incremental advance than a major breakthrough, the contribution is both novel and significant. -Are other people (practitioners or researchers) likely to use these ideas or build on them? Both practitioners and researches are very likely to use and build upon these ideas. One potential stumbling block for practitioners is the need to set the wild bootstrap parameter a_n. -Does the paper address a difficult problem in a better way than previous research? The problem of one sample goodness of fit testing for tau mixing data has not been adequately addressed in the literature in its full generality. However, there are instances in which the authors could compare their proposal to prior work and currently do not. For example, in the GP experiment, how does the proposed test compare to the MMD approaches of Lloyd & Ghahramani? In the density estimation experiment, how do the realized power and size of the proposed test compare with those of standard goodness of fit tests? -Is it clear how this work differs from previous contributions and is related work adequately referenced? There is very relevant prior work in the control variates literature (see Oates et al., Control Functionals for Monte Carlo Integration, 2014 and the papers that reference it) that also applies the Langevin Stein operator to functions in an RKHS and should be cited. Some of the results in this submission were previously established by Oates et al. including Lem. 3.1, Lem. 3.3, and Lem. 5.1. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): -I have a few questions concerning technical details of the theory developed: L380: “follows from assumption (i)” This statement does not follow from assumption (i), but it would follow from a corrected version of Assumption (ii) with \leq replaced by <. Assumption (ii): Assuming that the \leq \infty here is meant to be a strict “less than \infty” (otherwise, the Assumption (ii) appears to be vacuous), Assumption (ii) requires that E||grad log p(Z)||^2 < \infty for any random variable Z distributed according to any distribution q. This can only occur if ||grad log p(x)|| is uniformly bounded and is therefore a very restrictive assumption. The assumption, for instance, excludes all Gaussian and finite Gaussian mixture targets and most other targets of interest on unbounded domains. As a specific example, a standard Gaussian target p has unbounded grad log p(x) = -x, but E||grad log p(Z)||^2 = E||Z||^2 = \infty when Z is a Cauchy distribution q, since a Cauchy distribution does not have a second moment, so Assumption (ii) is violated. If this restriction is intentional, it appears to be a major limitation of the scope of the theory developed and should be highlighted for the reader. L409: The final line in the proof of Lemma 3.3 is incorrectly recorded. Should the final term be d/dx_i f(Z)? This identity is not explicitly justified in the text. L509: Is the technical condition sum_t t^2 \sqrt{\tau(t)} < \infty equivalent to \tau-mixing? If not, what does it imply for the Markov chains satisfying this property, and what chains are ruled out? Lem. 5.1: This statement is not correct at the level of generality stated. In particular, if the support of p has a non-neglible boundary (e.g., if p is a distribution on [0,1]), then E(Tf)(X) = f(1)p(1) - f(0)p(0) which need not be zero. It appears that p is implicitly assumed to be positive on all of R^d, but this assumption should be made explicit in the definition of the set P and in the statement of Lemma 5.1. Also, could the assumptions on F be reiterated in the statement of Lem. 5.1? -Questions and comments on practical details and experimentation L642: Since the thinning rate and the setting of the wild bootstrap parameter are important for calibration, how would you recommend that a practitioner set these parameters in practice for a new problem? L798: The text reads, “It is clear that ε = 0.09 yields a good approximation of the true stationary distribution,” but this statement presumes that the proposed Stein discrepancy test is accurate. Could some external evaluation of the quality of the samples generated with eps = 0.08 be provided to demonstrate that 0.08 truly does offer a good approximation and that, say, 0.17 does not offer a good approximation? L803: If one wanted to use the proposed test in an online fashion (alternating between gathering more training data and running the test based on the distribution fit to all training data), are there any multiple testing issues that one should be concerned about? Higher dimensions: It appears that all of the experiments involved either one or two-dimensional targets. Do you have evidence that the proposed test is useful for problems with higher-dimensional targets? For what range of dimensions would you recommend its use? Wild bootstrap parameter: How was a_n set in each experiment (I believe it is only reported for the first experiment)? Tau-mixing: Are the Markov chains studied in the experiments tau-mixing? -Questions and comments on accuracy of claims: L068 (and footnote 1): “which results from applying the Stein operator to the bounded Lipschitz functions” I do not believe that this is an accurate characterization of the recommended discrepancies of Gorham & Mackey, 2015. While it is true that bounded Lipschitz functions give rise to the Wasserstein metric (when they are not passed through a Stein operator), the “Stein sets” G explored by Gorham & Mackey are being passed through a Stein operator and do not coincide with the class of 1-bounded 1-Lipschitz functions. For example, the Gorham & Mackey paper describes the classical Stein set G of functions as 1-bounded, 1-Lipschitz functions with 1-Lipschitz gradients (that is, functions belonging the unit ball of the W^{2, \infty} Sobolev space). This is a smaller class than the class of 1-bounded, 1-Lipschitz functions. L084: “Wasserstein-based Stein discrepancy” For the same reasons given in the L068 comment above, it might be more appropriate to call the Stein discrepancies of the Gorham & Mackey paper $W^{2, \infty}$-Sobolev Stein discrepancies, since they do not appear to be based on the Wasserstein distance but rather are shown to dominate the Wasserstein distance. L071: “expensive linear program” Have the authors compared the computational expense of the linear program to that of the tests proposed? I’ve been experimenting with the publicly released Stein discrepancy code of Gorham & Mackey (https://github.com/jgorham/stein_discrepancy), and I’ve found that, empirically, the multivariate programs have quadratic complexity in the sample size, which is the same complexity as computing a V-statistic. -Are the authors careful (and honest) about evaluating both the strengths and weaknesses of the work? [See response to authors' rebuttal around these concerns.] I am concerned that Assumption (ii) [E||grad log p(Z)||^2 < \infty for Z ~ q and any q] severely limits the applicability of the theory and should be highlighted for the reader. I am also concerned by the need to set the wild bootstrap parameter a_n, and I believe the paper would benefit from more guidance on this matter. Finally, I am left wondering about the behavior of this test in dimensions higher than 2. ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper proposes a discrepancy statistic for measuring differences between two probability distributions. More specifically, the measure is based on the stein identity leveraged from the point of view computational advantages offered by the RKHS theory. Furthermore, applications to goodness-of-fit testing and related applications are provided. Clarity - Justification: The paper is very clear in terms of clarity Significance - Justification: There is no complete theoretical understanding of the problem and this looks like yet another test. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): This paper proposes a discrepancy statistic for measuring differences between two probability distributions. More specifically, the measure is based on the stein identity leveraged from the point of view computational advantages offered by the RKHS theory. Furthermore, applications to goodness-of-fit testing and related applications are provided. From the point of view of goodness of fit testing the authors claim that for MMD based test, computing expectation with respect to the known distribution (say P) is difficult and for the stein based statistic, it evaluated to zero. But, for the MMD, since the distribution P is know, one can always center the kernel and make it degenerate with respect to P which overcomes the problems. Comparison and discussion to this issue is appreciated. The idea is straight forward and every couple of years there is a kernel based test in ICML/NIPS which performs better than the previous year's test (as shown by experiments). It is not clear under what conditions would this test perform better and how to construct optimal test. A precise and complete theoretical understanding of the problem is lacking in the paper. ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The problem is important and the methodology is novel. The authors proposed a goodness-of-fit measure by introducing Stein’s method and smartly choosing functions from Reproducing Kernel Hilbert Space. The authors defined the statistic as the maximum discrepancy between the empirical sample expectations and the target expectations. The constructed statistic is essentially a V-statistic (the authors used a biased version) and one can take advantage of wild bootstrap to approximate the empirical distribution of the statistic. The proposed statistic can have a wide range of applications. For example, the statistic can be applied to test whether the MCMC samples are drawn from the target stationary distribution, which is interesting. Clarity - Justification: Overall, the paper is clearly written. It is enjoyable to read. Significance - Justification: The key contribution of this paper is that the authors cleverly select the functions from Reproducing Kernel Hilbert Space when applying Stein’s method to construct the goodness-of-fit statistic, which can be used to detect whether a set of data are generated from the target distribution. This is an improvement compared to the state-of-the-art method proposed by Gorham & Mackey (2015). Stein’s method is a useful tool to characterize the convergence in distribution. Gorham & Mackey extend the use of Stein’s method to quantify the sample quality from MCMC. However, in their method, the Stein discrepancy is obtained via an optimization problem which has no closed-form. Compared to Gorham & Mackey’s methods, this paper’s approach seems to be more computational tractable, more easily to implement and has closed-form. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): 1. The derivations of Stein discrepancy in Reproducing Kernel Hilbert Space seem reasonable. The constructed statistic is essentially a V-statistic and therefore can be evaluated by many existing tools. For example, one can use (dependent) wild bootstrap to approximate the empirical distributions of the statistics under i.i.d. or stationary, weakly dependent situations. 2. The experiments seem interesting. The authors consider a couple of possible applications. A tiny concern is that in some examples, I am not sure whether the dependent wild bootstrap is valid or not to approximate the empirical distribution of the squared Stein discrepancy. For example, in Shao’s paper (2010), he assumed that the samples are stationary and weekly dependent to guarantee the consistency in approximation. And in Leucht & Neumann 2013’s paper, they extent Shao’s dependent wild bootstrap method to U- and V-statistics. They prove the consistency under a stricter assumption that samples are strictly stationary and \tau-mixing. However, for MCMC, the sequential samples would transit from non-stationary to stationary gradually. Not sure whether it is valid to construct the empirical distribution of the squared Stein discrepancy from samples which are still non-stationary in the process. Probably the authors need a more careful check about the conditions. 3. Overall, the paper is clearly written. It is enjoyable to read. =====