We thank the reviewers for their detailed reports. From the comments we realized that there is an important point which is perhaps not immediately apparent in the paper, namely the historical context for this work. We detail this below and we hope it will help convince the reviewers of the significance of our results.$
Historical context.
In the 70s to 90s, the field of optimization largely focused on second order methods, such as interior point, cutting plane, or Newton. Nowadays, especially in machine learning, the community has switched to first order methods, primarily because of the beliefs that methods from the 70s-90s are not applicable to large-scale problems because the cost of solving linear systems is prohibitively large. The key message of this paper is that this not the case, and in fact those older methods can be used with a cost of nk+k^3 for the k^th iteration, instead of the prohibitive (and naïve) n^3. Our paper (and especially our numerical experiments) can be viewed as an invitation to the machine learning community to rethink the applicability of second order methods.
Reviewer_5:
We have added a proof of Claim 1 in the online version of this paper. The proof is basically same as the standard gradient descent proof and cutting plane method proof. However please note that we deliberately did not include this proof in the first submission as this result is not the main point of the paper. While we agree with you that theoretical guarantees are important, we hope you will also agree that there is room for more applied work at ICML. Such papers should be evaluated on the new insights that they provide, as well as the experimental results. Finally, we believe that the fact that this work leads to many new open problems should be viewed as a positive point for the paper.
Reviewer_6:
Our implementation of BFGS uses exact line search and it is known that this is equivalent to CG. For example, see the first lemma in "A Relationship between the BFGS and Conjugate Gradient Algorithms and its Implications for New Algorithms". That is the reason we write that CG, BFGS, BFGS+ are "optimal". Also note that we have added a proof of Claim 1 in the online version of the paper.
Reviewer_2:
Thanks a lot for your comments, they will greatly help to improve the paper! Here are detailed answers to a subset of the remarks:
130: Yes, x is in R^m. Thanks.
205-207: That is true when fval <= f(x).
245: You’re right it is y_i.
250: In fact this is a simple one-dimensional problem which we solve by binary search (namely keep decreasing alpha by constant factor until it is non-empty).
254: Whole line (as indicated in the equation).
307: We only wrote down the formula so that one can implement our algorithm without redoing the calculation. We checked it by numerical differentiation, and also note that the fact that our algorithm converges gives an evidence this is correct.
321: Yes, D = diag(d).
338: This comes from an appropriate (and simple) grouping of the terms.
348: It is a typo, thank you.
396: We believe that the algorithm you refer to only stores a weighted sum of previous gradients. Therefore, that algorithm still takes O(n) space instead of O(nk) space.
440: The geometric oracle is a typo, it should be geometric politician.
491: It is not entirely obvious, but it is an easy extension of original proof by Nesterov.
532: The formula is correct. b_i are labels of the sample.
598: You are right. In fact we do this out of fairness to the other methods, indeed if we make the function truly non-smooth then all algorithms except ours will fail.