Rev 2:$
(1) Acceleration: To obtain better rate, acceleration methods use extrapolation steps, while SDNA uses local curvature info. We are convinced that SDNA can be made “accelerated”, which should lead to further improvement (left to future work).. Thus, comparing our method with accelerated methods does not make sense – the tricks are orthogonal. 

(2) Notation: Our notation is not “weird”, it is related to standard pseudo-inverse. In fact, we considered the notation you suggest when writing the paper, but this leads to much more notation elsewhere! Indeed, note that your M_S is an |S|x|S| matrix (our M_S is an n x n matrix instead). Adoption of your notation would force us to introduce a transformation from |S|x|S| to n x n to be able to express objects such as E[M_S],(M_S)^{-1},E[(M_S)^{-1}] (see (10),(11),Lemma 5,..) These appear in many places in the paper; hence simple notation is needed. 

(3) Similarity to other  2nd-order methods: While our method may seem rather natural, it is novel and was not analyzed before. SDNA is the first randomized 2nd order method where the steps are performed in the dual of ERM. Things become much easier this way since under L2 regularization, the dual objective is quadratic plus a separable term! So, it is easy to utilize 2nd order info in the dual: the Hessian does not change, the difficulty is in its size! We prove that while standard 1st order stoch. methods can only hope to get sublinear speedup in minibatch size, our method always obtains superlinear speedup (Cor 10). No such result was known.

(4) Experiments: The amount of speedup we get has nothing to do with the dim of the problem, it has to do with “how much curvature lies in small-dim subspaces”, which depends on the structure of the data. Example 7 can be lifted into any larger dim and the same conclusions will hold. Having said that, the problems we solve are not that small. For cov data: n=522k. Experiments conclusively demonstrate our theoretical findings on standard datasets (not chosen to favor 2nd order methods). We will add more experiments to the suppl. material.

(5) “Some of the conclusions … are not strictly correct”: Your criticism is not justified. We say that the iteration complexities (i.e., worst-case bounds we prove) of the three methods are ordered in a certain way. That’s precisely what Theorem 6 says. We do not say that this means that the same behavior must be observed in practice. We study practical behavior through experiments.

Rev 4:

(1) “all curvature information”: We made this clear right in the introduction! See lines 101-124. 

(2) Line 201, 262: Minor issues corrected.

(3) Line 411: Since we give a more formal definition in the sentence right after that. We wanted this to be clear, so said the same thing twice in two different ways.

(4) Line 542: It is quite clear as stated: "..rate we can prove..." means “the best analysis known" since no other rate was established before. 

(5) Line 717: We wanted to highlight in Sec 6 that there is some connection between SDNA and IHS in the case of ridge regression. The methods are different though - Theorem 18 can be used to highlight this. We will make this more precise.

Rev 5:

(1) “cost of each iteration”: Lines 240-244 for Method 1 suggest the cost of 1 step is O(|S_k|^3); while Eq (9) for Method 3 suggests the cost is O(|S_k|). To this, we need to add the cost of maintaining a “residual”, which is then used to cheaply compute partial derivatives of f for i\in S_k. We will add a detailed explanation. Role of Method 2 in the paper is to serve as a theoretical tool enabling us to establish superlinear speedup of Method 1 (see Cor 10)- no need to analyze its cost. 

(2) Experiments: There is a tradeoff between the iteration complexity (IC) and the cost of each iteration. We highlighted this in how we ordered our experiments: in Exp 1 we essentially show that IC favors Method 1; in Exp 2 we show that cost of each iteration is higher (and growing in tau) for Method 1, and the combined effect is shown in Experiment 3 (lines 758-768). This should explain a part of the question you are asking: things get better as tau increases for Method 1 up to some point after which they get worse. This should be clear from the exposition. The pattern found in Figure 3 is universal to all problems, but the value of the best tau varies from problem to problem. Sparsity does play a role. A deeper theoretical understanding of the optimal tau is the topic of future research.

(3) Advancing the area: We give the first stoch. 2nd order method operating in the dual of ERM and the 1st stoch. method enjoying superlinear speedup in tau – a big step forward. Our theoretical analysis is highly novel, and the method works well in practice. Normally, one only obtains advantage in terms of local convergence for 2nd order methods, we obtain improvement in global complexity!