Paper ID: 678
Title: Solving Ridge Regression using Sketched Preconditioned SVRG

Review #1
=====
Summary of the paper (Summarize the main claims/contributions of the paper.):            The paper combines a variant of SVRG with non-uniform sampling with           sketching techniques to obtain approximate solutions of linear           systems. The sketching matrix is used to form a preconditioner that           is adapted to SVRG. The method is applied to large-scale ridge regression           problems.  The main result is an algorithm with lower computational           complexity than the vanilla SVRG algorithm with probability 0.9.  The           faster the eigenvalues decay, the larger is the speed-up.           The theoretical results are complemented by experiments that confirm           the speed-up on various datasets.        

Clarity - Justification:            The paper is easy to read and the story is easy to follow and I            only have minor comments:            - There is a typo on line 433: the squared ell_2 norm           should apply to P^(-1/2)w instead of w.           - Do we really need a proof for Theorem 4? It looks obvious by           definition of the average condition number applied to the Hessian of           the quadratic objective function.        

Significance - Justification:            The first impression of the reviewer when reading the paper was           mitigated. At first, it seemed an obvious combination of recent fast           randomized SVDs with SVRG to solve a problem with limited scope           (ridge regression).            Yet, this negative first impression changed after reading the paper in details:            - the right choice of preconditioner (line 502) is not that obvious;            - the way to use it with SVRG without computational overhead is also not trivial;             - the complexity analysis has moderate difficulty (not trivial either).                     In fact, I think that the paper spends too much space presenting           introductory material such as SVRG and disucssing the role of           pre-conditioning, but not enough time explaining the difficulties           that naive combinations of other variance reduction methods such as            SDCA/SAG/MISO/SDPC/... would encounter when applying a simple change           of variable \tilde{w} = P^(-1/2)w, which would change the condition           number. In particular, why not using SDCA? As far as the reviewer           remembers (sorry for not providing references), there are variants with           non-uniform sampling that would also replace in the complexity           analysis the max of the Lipschitz constants of individual functions           by the mean.  Thus, can we apply the sketched preconditioning to           accelerate SDCA with non-uniform sampling? What makes SVRG           specifically compatible with pre-conditioning? Also, a small           discussion about why extending the result to more general loss           function is hard would be interesting.             Overall, I think think that the contribution is significant enough           for ICML, even though going beyond ridge regression would be a more           significant (but maybe irrealistic) achievement.        

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.):            I have already detailed above why I believe that the contribution is           significant enough for ICML, even though my first impression was mitigated.           The presentation of the paper could however be improved to emphasize           a bit more why the contribution is not trivial (which was the first           impression of the reviewer when reading the paper).        

=====

Review #2
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): The paper proposes a preconditioner for the ridge regression problem to accelerate convergence for SVRG. The preconditioner is a truncated rank-k whitening matrix for the input data, which is obtained by a sketched truncated SVD introduced in Musco & Musco 2015. It provides an upper bound on the improved condition number as a function of the parameter k. In addition, it shows the average condition number improves significantly on several datasets.

Clarity - Justification: The paper is well written, clean and clear notations, good organization.

Significance - Justification: It seems the major contribution is using a truncated rank-k whitening matrix as a preconditioner. Compared with the full-rank version, such truncated version reduces computation from O(nd^2) to O(ndk), and still improve convergence when the eigenvalues of the covariance matrix decrease quickly.

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): There is a closely related paper [1] that also reduces the condition number by preprocessing (whitening) the input data. Although in [1], they are still doing a full SVD instead of a truncated SVD. So, it seems the contribution of this paper is using a rank-k whitening matrix to reduce computation, and also bounding the condition number given the perturbations introduced in the randomized block Lanczos method.  The idea of rank-k whitening seems useful and in practice the data covariance matrices do have decay spectra as demonstrated in the experiments. It is also nice that the paper provides explicit relationship between the parameter k and the improved condition number.  The major concern is limited novelty. 1) The main algorithm, sketched truncated SVD, is already proposed in Musco & Musco, 2015. 2) The analysis is bounding the condition number up to a multiplicative factor of the exact version given the multiplicative error in Musco & Musco, 2015. Although the analysis is non-trivial, I feel it is not significant enough for ICML. 3) The main idea is using rank-k whitening matrix instead of the full whitening matrix. Again, although it is non-trivial, the novelty and significance seems insufficient.  Minor comments: It seems whitening the input matrix is not closely tied to SVRG, and can be used in conjunction with any other first-order methods. Maybe the paper does not need to be restricted to SVRG in their title.  [1] Yang T, Jin R, Zhu S, Lin Q. On Data Preconditioning for Regularized Loss Minimization. Machine Learning. 2014 Aug 13:1-23.

=====

Review #3
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): The authors propose a preconditioning method for ridge regression, which can be combined with SVRG to dramatically improve its computational performance in some settings.   Both theoretical and empirical results are given to support the authors claims.  The method works by reducing the average condition number of the input sample covariance matrix (plus the ridge regularization).  In the numerical experiments, the authors consider an actual regression example using simulated data; the rest of the experiments involve well-known real datasets for classification, which are adapted to fit the ridge regression set-up (i.e. "square loss serves as a surrogate for zero-one loss").  

Clarity - Justification: The paper is clearly written and I enjoyed reading it.  

Significance - Justification: Ridge regression is a workhorse method in machine learning and statistics.  The authors' preconditioning method appears to offer substantial improvements over current state-of-the-art methods for fitting ridge regression to big datasets (i.e. SVRG).  It seems likely that practitioners will find the proposed approach useful.   On the other hand, the proposed method does appears to be a fairly straightforward combination of existing procedures (SVRG and preconditioning via Musco & Musco (2015)), which may diminish its novelty.  

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The paper is clearly written and the contribution seems practically useful/significant.   I have a few comments/questions:   -- How significant is the d \leq n requirement?  -- In equation (3), some concrete examples of the (perhaps asymptotic) behavior of the ratio might be helpful.   -- In the numerical experiments, I'd like it if the authors could explain more clearly how squared-error loss was used in the classification examples.  How were the outcomes y_i coded?   -- The fact that all of the authors' real data experiments are actually classification examples seems unfortunate.  I would hope that the authors could find some actual regression examples to illustrate the method.  Perhaps from genetics/genomics/biological sciences.  -- Can the method be extended to other loss functions?  

=====