Paper ID: 238
Title: Efficient Private Empirical Risk Minimization for High-dimensional Learning

Review #1
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): This paper considers differential private empirical risk minimization (ERM) problem over the generalized linear models under high dimension setting. Previously, ERM has been studied under the notion of differential privacy. In this paper, the authors try to tackle this problem under the constraint that only the projected data are accessible. The proposed mechanism is simple but effective: they solve the ERM problem in the projected space and then recover the solution by solving a convex programming problem. They also show that under certain conditions, a nontrivial utility can be preserved. Their results also imply a similar solution to traditional ERM problem.

Clarity - Justification: The paper is well written.

Significance - Justification: A solid paper, but the results have some limitations. See detailed comments. 

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): This paper considers differential private empirical risk minimization (ERM) problem over the generalized linear models under high dimension setting. Previously, ERM has been studied under the notion of differential privacy. In this paper, the authors try to tackle this problem under the constraint that only the projected data are accessible. The proposed mechanism is simple but effective: they solve the ERM problem in the projected space and then recover the solution by solving a convex programming problem. They also show that under certain conditions, a nontrivial utility can be preserved. Their results also imply a similar solution to traditional ERM problem.  Generalized linear models are widely used in machine learning, and the problem considered in this paper is very interesting. Overall the paper is solid and the main contributions are the theoretical guarantees of utility for solving a problem in a low dimensional subspace, which is usually computationally cheaper. However, there is still a limitation in this paper: for regularized ERM problem, the first term in the bound of Thm 3.11 is dominated by the second for a sufficiently large n (clearly, the second term is independent of n). Thus the result is only comparable with previous results when a problem contains no regularization term. So I wonder whether the bound in Thm 3.11 can be improved.  Minor comments:  - Line 200-203, Is it possible to provide some insight about how the Gaussian width plays a role in such compressed learning problem? 

=====

Review #2
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): The paper is about solving high dimensional differentially private ERM (with a low complexity hypothesis set) in the dimension reduced domain.  The approach is to compress high dimensional covariates using random projection, solve the compressed problem differential privately and then reconstruct the solution in the original space by solving a ``basis pursuit’’-like convex program.  Optimization error (excess empirical risk) bounds are derived and stated with explicit dependency on the Gaussian width and diameter of the parameter space, which goes beyond worst case analysis that only assumes bounded L2 norm. Key technical components that make the analysis possible are Theorem 3.1 and Theorem 3.8, which are useful new tools that could be used for a variety of other tasks.  The results imply the first agnostic bound for \epsilon-differentially private Lasso: L1 regularized (or 1-norm ball constrained) GLM, that has a logarithmic dependency in dimension.   In general it is unclear how sharp the bounds are in this paper.  The corresponding result for (\epsilon,\delta)-DP lasso is suboptimal w.r.t. the lower bound given in Talwar (2015), which seems to suggest that the bound for\epsilon-DP is also suboptimal.  In summary, the paper presents some new (albeit most likely suboptimal) results in the problem of differential private learning of a class of high-dimensional generalized linear functions.  The contributions are  clearly presented in the context of existing literature and the proofs are relatively straightforward and easy to follow.  As a result, I vote for accept to ICML.  

Clarity - Justification: No comments on this. It's presented in a pretty clear way.

Significance - Justification: The results are new, but seemingly suboptimal.

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Detailed comments are below:  1. Why does the regularizer has a form r(||\theta||_p) with r being a univariate convex function , rather than r taking theta directly as a input? 2. What is the point of having both the regularizer r and a compact space \Theta? p-norm is in general not preserved by random projection and it seems that this regularizer has no use whatsoever in the bounds in Proposition 3.7 and Theorem 3.11 (they do not go to 0!). Only when we take r = 0 or a constant function do we get the meaningful bounds. So I feel if there is no special reasons to keep them, dropping these regularizers will significantly improve the readability of the paper. 3. Footnote 10 does not really apply to this case because it is unclear whether there exists a sparse \theta that obeys \Phi \theta =\vartheta^{priv}.  Possibly useful references for the authors:  1. The idea of random projection (and add noise) for privacy purposes dates back to way before DP was invented.  e.g., Matrix masking of form Y= AXB + C. where A,B,C are random matrices were proposed in   G. Duncan and R. Pearson. Enhancing access to microdata while protecting confidentiality: Prospects for the future. Statistical Science, 6(3):219–232, August 1991.     Information-theoretic privacy guarantee of this approach on sparse linear regression was established in: - Zhou, Lafferty, Wasserman (NIPS’07, TIT’09) “Compressed and privacy sensitive sparse regression” This is for a different notion of privacy though.   2. The uses random projection to help achieve DP with improved rate has been studied before in quite a few instances. For example,   - Zhou, Ligett and Wasserman (ISIT’09) “Differential privacy with compression” that uses random projection to improve privately releases. - Blocki et. al. (FOCS’12)  “Johnson-lindenstrauss transform itself preserves DP” for releasing marginal. - Wang, Wang, and Singh. (ICML’15) "A deterministic analysis of noisy sparse subspace clustering for dimensionality-reduced data."   Random projection of data points allows DP to be possible without harming utility in sparse subspace clustering.  In these works, there is no need to ``lift’’ the parameter vector back to the original dimension. Perhaps the authors could highlight this step as the novelty in the algorithm. 

=====

Review #3
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): This paper addresses compressed learning via random projections while maintaining a guarantee for differential privacy.  The authors heavily leverage results from (Dirksen, 2104) on compressed learning and (Bassily et. al., 2014) on differentially private learning.  This paper proposes two algorithms for combining these two techniques, providing a relatively straightforward process for running an arbitrary differentially private algorithm in the projected subspace.  The more substantive contribution of this paper is the analysis of ERM bounds while maintaining privacy guarantees.  While this analysis again heavily draws on existing results, it is non-trivial and appears to be a unique contribution.

Clarity - Justification: I found this paper to be clearly written.  There was sufficient discussion of background literature, including citations for more detailed materials, and the new contributions were clearly described.

Significance - Justification: The essence of the methods proposed in this paper is just a combination of existing compressed learning and differential privacy techniques.  However, the authors’ analysis of the ERM bounds in this setting is non-trivial, and this appears to be the first time such bounds have been studied in this combined setting, so this paper still makes a significant contribution to the literature. 

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): While it’s obvious that a lower dimensional subspace will lead to efficiency gains with learning algorithm, it might be beneficial to have some brief discussion on the specifics, such as how the complexity of current differentially private learning algorithms depend on m (the dimensionality of the projected space), so that the reader can put the bounds established in this paper within the context of the efficiency gains.

=====