Paper ID: 1101 Title: Discrete Distribution Estimation under Local Privacy Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper deals with the problem of estimating discrete probability distributions under the local differential privacy model. In this model each sample is generated by a user who, in order to preserve her privacy, will only send a randomized version of her sample to the entity who is interested in estimating the underlying distribution. The paper studies two aspects of already existing methods: the expected risk as a function of the parameters of the problem (eg. support of the distribution, number of samples, and privacy parameter); and extensions of the methods to the setting where the support is not known a priori. An interesting empirical evaluation is presented, where the optimality of the different methods on a wide range of support sizes and privacy parameters is compared. This evaluation also explores the practically important problem of decoding, ie. how to obtain a proper distribution from a solution of a system of linear equations (possibly containing negative numbers). Clarity - Justification: In general the paper is well written, but some of the points are not entirely clear (see comments below). Significance - Justification: This is an important and timely topic and the presented results, though not ground-breaking, are quite interesting. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): 1/ It is not obvious why the optimal decoding depends on the shape of the underlying distribution. Is there any theoretical result that can explain this phenomenon or identify parameters that controls this behavior (eg. the entropy)? 2/ Section 5.1.2 says that in order to perform the decoding of an O-RR estimator one needs to fix the candidate set of symbols S. Could you explain how is this supposed to work in practice, eg. in the case where S can be a set of strings? 3/ I was surprised to see that K can go up to 512 in Section 4.4 but the size of S is taken to be 256 in the experiments described in Section 5.3. This seems like an extraordinarily small set size for evaluating systems with open alphabets. Is there a reason precluding the use of larger alphabets? Does it make sense to consider k’s greater than the size of S in these experiments? == Typos [Line 428] loss [less] than or equal to [Line 482 and the rest of the section] Missing boldface for vectors of probabilities [Line 676] has possesses ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper studies several local privacy mechanisms for distribution estimate, and considers in which regimes these mechanisms are (approximately) optimal. For binary alphabets (where each individual is simply a bit), they observe that randomized response is the optimal mechanism for arbitrary relationships of n and \epsilon; when the alphabet is k-ary, this is not necessarily the case. k-ary randomized response's error rate is suboptimal when considering l_2^2 loss for small \epsilon, but for large (~ log k) \epsilon, k-RR is optimal up to constants for the same loss function. k-RAPPOR, a local privacy-preserving mechanism, has the opposite property: for small \epsilon, the mechanism is optimal up to constant factors in terms of its l_2^2 error, and for \epsilon~ log k, it is strictly suboptimal in terms of its l^2_s error rate. They also study several practical aspects of using these two mechanisms, including which decoding procedures perform better under various assumptions about the true distributions; and how to deal with unknown-sized alphabets (a problem in practice). They find that, for skewed distributions and for a wide range of parameter values, a particular decoding method that they introduce outperforms several other natural choices (the ML decoder and the normalized empirical decoder). They show that an extension of RR to an unknown alphabet size meets or beats the performance of RAPPOR when extended to allow unknown alphabet sizes. Both methods use hashing to extend to this setting. Clarity - Justification: The paper is, on the whole, well-written and easy to follow. The organization is also well-considered. Significance - Justification: Presentation of the paper's primary contributions: The theoretical contributions of the paper are clean; they describe settings in which different locally private algorithms achieve (near)-optimal error rates for l_1 and l_2^2 loss over non-binary alphabets. They also extend a known result about the optimality of randomized response in the binary setting from applying to certain loss functions to arbitrary risk functions. I think the practical aspects of the paper are quite interesting, since both of the problems they study seems as though they would arise in practice. It's nice to see some empirical evidence of which of the proposed methods to deal with open alphabets and decoding perform better in different settings. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Section 4.3 has several places in which it could be improved for clarity. For one, I'd suggest sticking with either using expected loss *or* sample size, and not switching back and forth between them; this makes it easier to flip back and forth and make direct comparisons. [This actually might be useful as a suggestion throughout the paper.] Also, would be useful to parenthetically make direct comparisons (e.g., "the sample size is X (while in the non-private setting it is Y/ we have an upper bound guaranteeing the existence of Z/a lower bound showing this is approximately optimal"). Secondly, it'd be useful to describe precisely which of your (for example, the two parts under "performance under X"; I believe the answer is "both" but it took a minute for me to verify this is probably what you meant) claims hold for l_1 versus l_2^2 loss, since you say you'll be describing both for both mechanisms, in several regimes of parameters. Also, is there an analogous statement to proposition 5 for l_1, or do you/we understand the relationship of the performance of k-RR versus RAPPOR under L_1 loss? I also think Prop 1's upper bound will only be true for *some* Q, not all Q. One example is the constant mechanism; it's perfectly private and has terrible l_1 and l_2 error. Maybe this is just because of the way this is written; maybe the statement shouldn't refer to Q at all. The displaymath doesn't refer to Q, just the best-possible risk attainable (which is correct). It is somewhat confusing to refer to Q as a mechanism and as a distribution (unless you've applied it to inputs) in 2.1. The output of Q can be described as a distribution over distributions; the mechanism without input isn't really either sort of distribution. "hidden spinner" should maybe be replaced with "flips a biased coin (that only they can see the result of)" or something analogous. that phrase only describes RR if you already know what RR is. Having glanced at the appendix, I think it'd be useful to mention before Prop 3 that (and maybe why) the uniform distribution is the worst-case distribution for these estimators, since while you state that after Props 3/4, that might lead the reader in the right direction. 4.3 "loss than or equal to" -> "less than or equal to"? ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper compares two algorithms for differentially-private distribution estimation (k-Randomized Response and k-Rappor) and presents extensions for open alphabets and hasing/cohorts (O-Randomized Response and O-Rappor). Performance of the algorithms are analyzed in terms of sample complexity and also compared empirically by simulation. Clarity - Justification: The paper is written clearly. The claims are precise and mostly well justified. Significance - Justification: Differentially private estimation of discrete probabilities is a basic and important requirement in private data analysis. The proposed algorithms (O-RR/O-Rappor) seem to be practical solutions to the problem. The two main contributions of the paper are 1) careful analysis of pre-existing methods (k-Randomized Response and k-Rappor) and 2) presentation of an extended algorithm (O-RR and O-Rappor) building on previous methods. Although these are well-executed, the novelty of this paper relative to Erlingsson'14 and Kairouz'14 is not very clear (see detailed comments.) Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The paper is different from Erlingsson'14 or Kairouz'14, but what is the core novelty? Is it the detailed analysis? Or is it in the extensions with hash, cohorts, and open alphabet extension? If O-RR and O-Rapport are the new contributions, their exposition in Sec 5 seems rather brief compared to the space spent in discussion of previous methods (Sec 4.) No analysis on O-RR or O-Rapport is presented. Fig. 2: Why is O-RR performing better than others (for finite alphabets) when O-Rappor is only as good as k-Rappor? The result seems important and begs for an explanation. Is the "low privacy region" (\epsilon ~ \log k) of practical importance? For example, with k=512, the probability ratio of 512 is perhaps not very private. =====