Paper ID: 980
Title: Differentially Private Chi-Squared Hypothesis Testing: Goodness of Fit and Independence Testing

Review #1
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): Data samples may contain highly sensitive information about the subjects and so privacy of the individuals can be compromised when the results are released. The authors develop new techniques to support privacy-preserving data analysis by developing hypothesis tests that are differentially private and give conclusions similar to standard, non-private tests for categorical data. In order to establish their claim, the authors construct four differentially private tests based on chi-squared statistic, with achievable level of significance but compromising on the level of power of the tests.   

Clarity - Justification: The paper is well written with minimal typos. Overall, we find it to be mostly clear and the technical points are mostly clear. We address detailed issues below.    

Significance - Justification: The methods proposed are very useful because when differential privacy is applied to a dataset, the loss of utility is a concern, and the methods in this paper allow hypothesis testing on a privacy dataset to determine whether the data comes from a certain distribution, and whether two sequences of data are independent. In addition, their methods  provide an interface on the differential privacy scheme, so that on can apply established results from hypothesis testing to reveal statistical properties of a dataset without compromising privacy.   However, while the paper establishes method for testing in privacy set up, we have many concerns, which we raise in 6. 

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): A very major issue:  Overall, this paper contains meaningful results and is written in a clear and logical manner. However, one issue in this paper is the lack of theoretical proofs. In each proof, the authors add Gaussian noise to the original dataset. Is this out of convenience?  Since the authors focused on chi-squared statistics in this paper, Gaussian noise is more compatible with the asympototic analysis, i.e. they picked an easy condition.  How difficult are extensions. This seems quite important to talk about in future work. Thus, it may be much more difficult to adapt Laplace noise into their propose scheme, however, without addressing head on, it's not clear. The authors only provided empirical results for Laplace noise in this paper, and the paper would be much stronger with a theoretical proof for Lapacian noise.   Other Major comments/concerns:  1. From the simulation study, Figure 4, it is evident that there is a loss in power for substantial even for fairly reasonably size sample sizes. Given this, what is the threshold sample size for which this power can be controlled? The practical applicability of the method is not clearly justified, along with any computational costs, such as the computational complexity of the problem.  2. Moreover, simulation studies for independence are provided for in 2 by 2 contingency tables. How does do the authors methods  apply for a k by k contingency tables and what is the computational complexity here?  3. One major shortcoming is that the methods have not been applied to any real dataset. Could the authors speak to why this is the case. It seems that there are some publicly available. Overall, the paper tries to establish nice theoretical properties concerning private chi-squared testing but the main concern is its practical applicability and computational costs compared to ordinary tests. This seems to be a major issue that is not addressed head on by the authors.   Minor: 1. (Please note that the citation reference formats are not proper. For example: in lemma 6.1, 6.2, 6.3, there are multiple brackets). 2. Plots: The characters in these are very hard to read. 

=====

Review #2
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): The paper is about extending classical/widely used statistical test: Goodness of Fit test, Independence test for multinomial distributions, to allow privacy protection under the formalism of differential privacy.   The results are correct for my best ability to check and it goes beyond standard method of typical differential privacy methods where one adds noise and then stick to the noisy answer. The proposed method involves an additional post processing which is a statistical inference that takes into consideration of the distribution and additional uncertainty induced by differential privacy, which intuitively should help the method to achieve more accurate \alpha-level control and better power. Indeed the experimental results show pretty substantial improvements.   In addition the paper is extremely well-written and could be serve as a very good introduction to those machine learning researchers who knows only a little about hypothesis testing.  Overall I think the paper is technically sound, reasonably novel, and clearly successful in addressing an important problem relevant to analyzing sensitive datasets (GWAS, other medical studies). 

Clarity - Justification: Extremely well written and easy to follow.

Significance - Justification: It is solving an important, practically highly relevant problem of a class of private hypothesis testing. In that regard the contributions are very significant. I wouldn't call it a major break-through though.

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Detailed comments are below: 1. Given the obviously binary outcome (Reject or Fail to Reject), the gold standard is perhaps just run exponential mechanism (with score function being the power) on these two outcomes? It would be nice if the authors can discuss about the difficulties of this approach. I would imagine the power of these tests should be pretty insensitive w.r.t. changing just one data point.  2. Along the line of thought in Point 1, suppose it is not the binary test outcome that we need to release, but rather the p-value. This is a more challenging problem, but if p-values are insensitive, then potentially we can release it rather accurately. In fact, we don’t care about large p-values, therefore the “sparse vector technique” or “propose-and-test” might be of interest to further boost the power of the corresponding tests that people can  run using this private p-value. Of course this test can be further customized by taking into consideration of the known uncertainty induced by the private release of the p-value.  3. This is more for future work, but I think it provides a nice framework of analyzing nonparametric two-sample test, and independence tests using Max Mean Discrepancy (e.g., kernel two-sample test, Hilbert-Schmitt Independence Criterion). 

=====

Review #3
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): The paper proposes several differentially private hypothesis tests and independence tests for multinomial distributions. The paper derives required adjustments in the threshold to meet 1-\alpha significance caused by Laplace and Gaussian noise added to the counts. 

Clarity - Justification: The paper is written and organized properly. Since the proofs are key contributions of this paper and are relatively short, they could have been partially presented in the main paper instead of appendix. There are several places in the paper that can use more explanations.

Significance - Justification: Differentially private hypothesis and independence tests for multinomial distributions are important problems, and the paper suggests new algorithms to achieve the desired significance (1-\alpha) accompanied by the inevitable loss of power. The reviewer is not an expert, but the key questions about the performance of the algorithms are unclear from the text (see detailed comments.)

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The paper presents several algorithms and compares them empirically, but the reviewer cannot form a clear picture of which algorithm is better in what aspect and under what conditions. Interpretations of experimental results are missing and left for the reader to figure out. In particular, should correct significance be prioritized over power?  What can we say about the trade-off of significance-power of these algorithm, either analytically or empirically?  How good are these algorithms compared to previous methods cited in lines 134 and 143?   In line 767 and Fig 1, why doesn't MCGOF need testing?   In Fig 2, for a small n, what does the empirical significance of 1 (instead of 0.95) indicate? Does it mean the algorithms will further require small-sample adjustment in the threshold?  In lines 798-805 and Fig 3, what is the critical value?  

=====