Paper ID: 142
Title: A Kernelized Stein Discrepancy for Goodness-of-fit Tests

Review #1
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): The paper develops a kernel goodness-of-fit test by extending results of Gorham and Mackey

Clarity - Justification: It is easy to read and concepts are introduced clearly.

Significance - Justification: There are many applications of the test.

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): How do kernel parameters (bandwidth) affect the size and power of the test?  How does the power depend on the dimension of the problem? It would be useful to provide a bound or a simulation illustrating the effect.  Can the procedure be used to evaluate draws from an MCMC scheme? The current setup assumes IID observations, while here the observations would be weakly dependent. 

=====

Review #2
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): The authors use U-statistics and the multivariate Langevin Stein discrepancy introduced in (Gorham & Mackey, 2015) with an RKHS-based Stein set to produce a new one-sample goodness of fit test for i.i.d. data with O(n^2) computational complexity when n is the sample size.  Under an assumption on score function differences, they show that their statistic is asymptotically measure determining for distributions with smooth densities and characterize the asymptotic distribution of a scaled statistic under the null and alternative hypotheses.  Because the CDF of the asymptotic null distribution is difficult to work with, the authors employ the bootstrap method of (Huskova & Janssen (1993); Arcones & Gine, 1992) to obtain a consistent estimate (under the same score function assumption) of the asymptotically optimal threshold for their statistic.  The statistic and bootstrap-derived threshold together define their proposed test.  The authors show that the test performs comparably to a number of a standard goodness of fit tests in a simulated univariate example with a Gaussian mixture target and comparably to a Monte-Carlo-based two-sample test in a simulated multivariate Gaussian-Bernoulli RBM example.

Clarity - Justification: The paper on the whole is very clearly written, but there are a few instances in which clarity could be improved:  -L077: The phrase “general smooth functions” should be clarified. -L129: “efficient empirical estimation” Is the word “efficient” being used in the statistical sense of efficiency?  In the computational sense? -L184: “positive definite kernel” is used here instead of “positive semi-definite,” which is used elsewhere. -L198: Does the parenthetical “(smooth)” in this sentence mean that “smooth” is being defined to mean continuously differentiable in this paper?  If so, it would be clearer to make this definition explicit (or to use standard notation like C_1); otherwise, it would be clearer to remove this parenthetical. -L224: Since the domain is R^d, you may want to write ||x|| \to \infty instead of x \to \infty. -Eq. 6: I believe the final x should be an x’. -L226: Since the phrase is used on multiple occasions, it would be helpful to add a definition of a “regular density.” -L607: The authors suggest that the Fisher divergence has never been used for model selection, but this seems like a special case of the standard use of Fisher divergence (in the context of score matching) for selecting the best model from a class of models parameterized in some manner.  Could the authors clarify how this application is different from the standard one? -Sec. 5.2: Can the KSD can be viewed as an instance of MMD with a q-specific choice of RKHS? -Fig. 2: It is unclear from the display what the y-axes are measuring in Figs. 2c-h, and the reader may be tempted to assume that they are also measuring error rate. 

Significance - Justification: -Does the paper contribute a major breakthrough or an incremental advance?  This paper is somewhat incremental in that the combination of Stein operators and RKHS theory has precedent in the uncited paper (Oates et al., Control Functionals for Monte Carlo Integration) and that the use of computable Stein discrepancies to measure sample quality and the use of the multivariate Langevin Stein discrepancy in particular stem from (Gorham & Mackey, 2015).  However, the main novelty of the paper comes from applying these ideas to the one-sample goodness of fit testing setting, analyzing the asymptotic behavior of the resulting test statistic, and developing a consistent bootstrap-based choice of test threshold, and this novelty is significant.  -Are other people (practitioners or researchers) likely to use these ideas or build on them?   I believe both practitioners and researches are very likely to use and build upon these ideas.  -Does the paper address a difficult problem in a better way than previous research?  The paper certainly addresses a difficult problem (goodness of fit testing for multivariate distributions), but it is not clear whether this approach provides a better solution than existing multivariate one-sample goodness of fit tests (like the multivariate Cramer-von-Mises test and the multivariate Kolmogorov Smirnov test studied in (Chiu and Liu, Generalized Cramér–von Mises goodness-of-fit tests for multivariate distributions)), since the authors do not compare with these tests in their experiments.  -Is it clear how this work differs from previous contributions and is related work adequately referenced?  There are several places were references to related work could be improved:  L011: In several sections, the authors state that their work is based on “a novel combination of Stein’s method and the reproducing kernel Hilbert space theory.”  However, the uncited prior work of (Oates et al., Control Functionals for Monte Carlo Integration, 2014) also explores the combination of the multivariate Langevin Stein operator, Stein’s identity, and RKHS theory, and this work is subsequently used in several papers by Chris Oates and collaborators (under the name “control functionals”).  For example, Thm. 3.5 of this paper is a consequence of Thm. 1 of (Oates et al.); Prop. 3.4 of this paper is discussed in the prose following Lem. 1 of (Oates et al.).  Sec. 5.2: MMD is not only useful for two-sample tests.  It can also be used for a one-sample test if the expected kernel function can be computed in closed-form under q.  Since both the 1D Gaussian mixture model and the Gaussian-Bernoulli RBM are instances of Gaussian mixtures, a one-sample MMD test with Gaussian RBF kernel could also be carried out for each experiment.  It would be illuminating to compare the proposed kernelized Stein discrepancy with this prior art.   Eq. 11: This formulation of the kernelized Stein discrepancy is an instance of the multivariate Langevin Stein discrepancy S(P, T_Q, G) introduced in (Gorham & Mackey, 2015), where (T_Qg)(x) = < g(x), grad log q(x) > + < grad, g(x) > is the multivariate Langevin operator.  In this case, the Stein set G is the unit ball of H^d.

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): -Comments and questions concerning technical correctness  -- Assuming (s_q - s_p) \in H^d:  The theoretical justification of the kernelized Stein discrepancy seems to hinge on the final statement of Prop. 3.2: if (s_q - s_p) is in H^d, then S(p,q) > 0 whenever p != q.  However, the assumption that (s_q - s_p) is in H^d is very restrictive and does not hold for most examples of interest.  For example, if H is a Gaussian RBF RKHS (as recommended and used in the paper), then all functions in the RKHS decay to zero as ||x|| goes to infinity (i.e., are in C_0), but s_q - s_p quite often does not decay to zero as ||x|| goes to infinity.    A simple example comes from comparing two Gaussians with different variances.  If p and q are univariate mean-zero Gaussians with variances 2 and 1 respectively, then s_q(x) - s_p(x) = -x/2 - (-x) = x/2, an unbounded function which is therefore not in the Gaussian RKHS.  The authors suggest in the prose following the proposition that, to ensure S(p,q) > 0 whenever p != q, it is sufficient to select an RKHS that is dense in the set of continuous functions on compact subsets of R^d under the L\infty norm, but I do not see why L\infty norm density on compact subsets is sufficient to guarantee that S(p,q) > 0 whenever p != q.  If the authors can augment Proposition 3.2 with a precise proof that a universal kernel such as an Gaussian RBF guarantees the S(p,q) > 0 whenever p != q property even when s_q - s_p \noin H^d, this would be a significant and important strengthening of the result.  As currently  established, Prop. 3.2 with Gaussian RBF only covers examples in which s_q - s_p is a vanishing function, which is seldom the case when dealing with unbounded distributions.  Note that the same assumption is made in Thm. 4.1 (to establish the asymptotic properties of the U-statistic), in Thm. 5.1 (to relate S(p,q) to the Fisher divergence), in Prop. 4.2 to analyze asymptotic power, and in Thm. 4.3 (to establish asymptotic correctness of the bootstrap approximation). --  L077: The cited result (Prop. 2.4 of Stein et al., 2004) does not consider a set of smooth functions f but rather a set of regular functions f (which include indicator functions and may be discontinuous).  Eq. 1: This identity does not necessarily hold for densities supported on only a subset of the real line (like the Uniform[0,1] distribution).  L295: In the proof of Prop. 3.2, the claim is that S(p,q) = \sum_l \sum_j \lambda_j \alpha_{jl}^2, but I believe this is only true if the e_j(x) are eigenfunctions satisfying \lambda_j e_j(x) = int k(x,x’) e_j(x’) p(x’) d_{x’} (i.e., eigenfunctions for L2(p)).  However, Sec. 2.1 suggests that e_j(x) are instead eigenfunctions satisfying \lambda_j e_j(x) = int k(x,x’) e_j(x’) d_{x’} (i.e., eigenfunctions for L2 under Lebesgue measure).  -Comments and questions concerning empirical evaluation:  --How many bootstrap replicates were used for KSD-U in your experiments, and what recommendations could you give to practitioners for setting the number of replicates?  --My main concern with the experimental evaluation is that the authors do not compare to pre-existing (non-Monte Carlo based) multivariate goodness of fit tests like the multivariate Cramer-von-Mises test and the multivariate Kolmogorov Smirnov test studied in (Chiu and Liu, Generalized Cramér–von Mises goodness-of-fit tests for multivariate distributions).  —-Would it be possible for the authors to decompose the error rate (into the FP rate and FN rate), so that the reader can evaluate whether the empirical size of the test matches the desired 0.05 level and what power the test achieves?  --Why is LR-AIS not included in the comparison in Fig. 2a?  If a test was not constructed, how do the authors know that AIS performed reasonably well, as stated on line 860?  -Comments on terminology  L076: Typically the term “Stein’s identity” (also known as Stein’s lemma) is reserved for the “if” part of this statement: that is, if q = p, then we have the identity E_q[s_q(x)f(x)+grad_x f(x)] = 0.   L014: The authors describe their contribution as a combination of Stein’s method and RKHS theory.  However, as the authors also note, Stein’s method is a tool for bounding integral probability metrics in terms of a Stein discrepancy, by rewriting test functions using the Stein operator underlying Stein’s identity. While the authors do use a Stein discrepancy as the basis for their proposed procedures and Stein’s identity (integration by parts) to help characterize the procedures, they do not develop Stein’s method for the use of bounding an integral probability metric (and, indeed, Stein’s method has not been developed for most of the multivariate cases of interest in this paper).  Since there are important technical distinctions between Stein’s method, Stein discrepancies, and Stein’s identity, I believe it is inaccurate to characterize the work as “a combination of Stein’s method and …” but would be very accurate to alternatively describe it as a combination of Stein discrepancies or Stein’s identity and …  L097: “turning the optimization into quadratic programming”  The recommended graph Stein discrepancy procedure described in (Gorham & Mackey, 2015) is a linear program.  Title: The title of the paper is “A Kernelized Stein Discrepancy for Goodness-of-fit Tests and Model Evaluation.”  However, the paper seems entirely focused on goodness-of-fit testing with no separate discussion of or experimentation with the task of model evaluation.  I wonder if the shortened title “A Kernelized Stein Discrepancy for Goodness-of-fit Tests” would be more appropriate.  -Is this a complete piece of work, or merely a position paper?   This is a complete piece of work and not merely a position paper.  -Are the authors careful (and honest) about evaluating both the strengths and weaknesses of the work?  [Update: see response to author feedback.]  I am concerned that the assumption (s_q - s_p) \in H^d is a major restriction on the applicability of the theoretical results, and I hope that the authors can resolve this by either highlighting the restriction or carefully proving that the assumption can be circumvented.

=====

Review #3
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): This paper proposes a discrepancy statistic for measuring differences between two probability distributions. More specifically, the measure is based on the stein identity leveraged from the point of view computational advantages offered by the RKHS theory. Furthermore, applications to goodness-of-fit testing and related applications are provided. 

Clarity - Justification: The paper is well written

Significance - Justification: There is no complete theoretical understanding of the problem and this looks like yet another test. 

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): This paper proposes a discrepancy statistic for measuring differences between two probability distributions. More specifically, the measure is based on the stein identity leveraged from the point of view computational advantages offered by the RKHS theory. Furthermore, applications to goodness-of-fit testing and related applications are provided.  From the point of view of goodness of fit testing the authors claim that for MMD based test, computing expectation with respect to the known distribution (say P) is difficult and for the stein based statistic, it evaluated to zero. But, for the MMD, since the distribution P is know, one can always center the kernel and make it degenerate with respect to P which overcomes the problems. Comparison and discussion to this issue is appreciated.   The idea is straight forward and every couple of years there is a kernel based test in ICML/NIPS which performs better than the previous year's test (as shown by experiments). It is not clear under what conditions would this test perform better and how to construct optimal test. A precise and complete theoretical understanding of the problem is lacking in the paper.  

=====

Review #4
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): The problem is important and interesting. The authors proposed a goodness-of-fit measure called Kernelized Stein Discrepancy (KSD) to quantify the differences between two probability distributions. They constructed the statistic on the basis of Stein’s method and Reproducing Kernel Hilbert Space theory. Their method is likelihood-free. Suppose the target distribution is q(x), only the score function of q(x) is needed to be evaluated which is computationally efficient. And the proposed statistic can be applied to high-dimensional models where the traditional goodness-of-fit tests would fail. 

Clarity - Justification: Overall the paper is well written. It is not hard to follow. 

Significance - Justification: Justification: The authors use Stein's method for comparing distributions by choosing a set of test functions. Similar idea has been explored recently in Gorham & Mackey (2015). The novelty here is to restrict the test functions to be in the RKHS, such that the optimization problem of finding the worst function can be solved in closed form. When given finite samples, this leads to simple U-statistics where many existing results can be used. Although this approach generalizes Fisher score, it is not clear what is the advantage of current approach over Fisher score. 

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): 1. Probably one of the main limitations of this paper is that the authors only considers i.i.d. samples. This ideal assumption will definitely limit the application of the proposed statistic. Properties of U- and V-statistics based on some stationary, mixing conditions have been well studied. Such as Denker & Keller 1983, Beutner & Zahle 2012, etc. It will be better if the authors can also take into account some non i.i.d. cases.  2. The derivations are rigorous and self-contained. The main results are shown in Theorem 3.5, which describes how to compute the proposed KSD. It seems easy to implement. When approximating the empirical distributions of the proposed statistic (essentially a degenerate U-statistic under the null hypothesis), the authors claimed that a lot of existing works can be helpful, which is reasonable.  3. The authors also evaluated some properties of their proposed statistic. For example, they constructed the connection of the proposed KSD to some traditional statistic measure, such as Fisher divergence and MMD, which is interesting. 4.     If the experiments can be better designed, probably the results will be more convincing. In the current examples, the authors fixed the significance level and computed the error rates (power) for varying strength of signals. However, more often, when we want to comprehensively evaluate a statistic, we need to evaluate its false alarm rate as well as power or equally type I error and type II error. Therefore, probably something like the AUC curve will be better compared to just the error rate. 

=====