We thank the reviewers for the detailed feedback. The reviewers recognized the practical (Reviewer 2) and theoretical (Reviewer 3) significance of our contribution.$
Reviewer 1, Novelty concerns
We understand that since our method builds upon two recent developments (i.e., SVRG and a variant of Block Lanczos), our contribution might seem as a simple combination of these two methods. However, as the Reviewer 3 also remarks, our contributions are highly not trivial. Following the suggestions from Reviewer 3, we will make more clear in the final version that:
a) “the right choice of preconditioner (line 502) is not that obvious” (Reviewer 3)
b) “the way to use it with SVRG without computational overhead is also not trivial” (Reviewer 3)
c) “the complexity analysis has moderate difficulty (not trivial either)” (Reviewer 3)
d) It is crucial to use SVRG/SDCA/... with non-uniform weights.
e) It is not trivial to translate the guarantees of a fast SVD method into guarantees on the resulted condition number. In particular, this result requires strong guarantees on the relation between the singular values (while standard fast SVD methods only provide a bound either on the Frobenius or the spectral norm). As far as we know, it is the first time that this proof technique is used.
Reviewer 1, Missing reference
Thanks for the missing reference. We will certainly relate to it in the camera ready. While we use low-rank approximation in order to make the preconditioning process efficient, Yang et al. suggest using a random mini-batch. It seems that these two methods complement each other rather than proposing a similar approach.
Reviewer 1 and 3, Why SVRG only?
We could equally rely on SDCA/SAG/MISO/etc. The only reason for using SVRG was the relevant bound for non-uniform sampling. However, this bound can be derived for all the above algorithms. We will better explain this point and discuss the possible extensions to other algorithms.
Reviewer 2 and 3, Going beyond Rigde regression
Of course, it would be interesting to understand what other type of loss functions can be considered. For example, when minimizing self-concordant functions, the Newton step requires minimizing a quadratic form. Therefore, it is natural to ask if we can benefit from applying our method for approximating the Newton step. Also, from the empirical point of view, it would be interesting to characterize cases in which the gain resulted from the sketched conditioner is significant. We believe that our paper makes an important first step towards these goals.
Reviewer 2, How significant is the d \leq n requirement?
The requirement d \le n is not significant. We will remove this assumption and update the bounds accordingly. Thanks for pointing it out.
Reviewer 2, In equation (3), some concrete examples of the (perhaps asymptotic) behavior of the ratio might be helpful.
Indeed, we will add concrete examples of the behavior of the ratio.
Reviewer 2, Experiments details
Due to the space limit, the details of the experiments will be added to the Appendix of the paper.
Reviewer 3, Do we really need a proof for Theorem 4?
There are two definitions of average condition number: a) the trace of the Hessian divided by the last eigenvalue. b) The average smoothness of the components appearing in the objective divided by the overall strong convexity. Theorem 4 says that in our case, the definitions are equivalent. Since the proof follows almost immediately from the definitions, we put it in the Appendix.