To all reviewers:$ Our convergence rate in Thm 1 is many orders of magnitude better than the convergence rate of the only other stochastic variable metric method for ERM with a linear convergence rate: the one by (Moritz et al., 2015). The constants of convergence in (Moritz et al., 2015) get exponentially small as the dimension d grows, see Lemma 4 in (Moritz et al., 2015). In contrast, our rate does not depend on the dimension d. This significant improvement is due to our analysis of the spectrum of the block BFGS metric in Lemma 1. In particular, the theory by Moritz et al. requires a stepsize that is below machine precision except for trivially sized problems. To illustrate, consider the reasonable regime where M =3, lambda = 10^(-1), Lambda =1 and d= 10^3. The theory of Moritz et al. would require that the step size be below 10^(-6000). In this regime, our stepsize only needs to be below 10^(-6). We added a more detailed comparison to the paper. While our rate is orders of magnitude better than the best known linear rate for stochastic variable metric methods, it is not better than that of state-of-the-art first order methods, such as SDCA or SVRG, which are much simpler methods and hence much easier to analyze. Hence, it does not make sense to make an explicit comparison. Still, i) our method substantially outperforms the state of the art, which is what ultimately matters, and ii) we obtain the best known rate for stochastic variable metric methods. Since submitting, we have also completely redesigned and optimized the experiments. Our code is now significantly better; and we have added several new insightful experiments to the supplementary material. We have also addressed all suggestions in the paper and hence believe no issues remain. Reviewer 1: Addressing the questions in the "Clarity" section: 1) The difference of gradients \nabla f_S_t (x_t) - \nabla f_S_{t} (x_{t-1}) is an approximation of \nabla^2 f_{S_t} (x_t) d_t, that is, an approximation of the action of the Hessian along the direction d_t = x_t - x_{t-1}. In the classical setting of qN methods, the difference of gradients is used to save on computing the directional derivative of the Hessian. In our stochastic setting, to calculate this difference of gradients, we need to compute the additional subsampled gradient \nabla f_{S_{t}} (x_{t-1}) at each iteration t. Using reverse AD (or simply hand coding), calculating this additional subsampled gradient costs the same as calculating a Hessian directional derivative. Thus calculating this additional subsampled gradient defeats the purpose of saving on computation. 2) There is a typo regarding line 578 and the expectation E[x_t]. Line 578 should read “where we used that w_k = x_0 and \sum_{t=1}^m E[x_t] = m E[w_{k+1}].” The above equivalence is due to option II on line 373, where w_{k+1} = x_i and i is chosen from {1, ..., m} uniformly at random. We have now made this clearer, and we thank you for asking. 3) We comment and compare the efficacy of the three sketching methods empirically in Section 5, where we conclude the that prev sketching method is overall the most robust amongst all methods (including SVRG and MNJ). 4) Please read "To all reviewers" above. Reviewer 8: Addressing the questions in “Detailed comments”: 1) Our stochastic block BFGS method IS new. In fact, even (non-stochastic) BFGS was not previously studied in the block setup due to certain technical difficulties which we managed to overcome. The way we perform stochastic steps is very different from previous works: note we look at the action of the (subsampled) Hessian on a random matrix! This is a completely key and novel aspect of the method. This is closely related to a recently proposed breakthrough in the area of inverting matrices (via a stochastic method), and can be seen as an adaptation thereof to the optimization/ERM setting. This is entirely new. Furthermore, while we embedded our stochastic block BFGS method within SVRG, it can be embedded within other ERM solvers. Essentially, we propose a new stochastic technique which can cheaply obtain a much better estimate of local curvature - and this method can be embedded in many first-order methods, such as SGD, SAG and SDCA. When applied to quadratic functions, the matrix produced by stochastic BFGS converges linearly to the inverse of the Hessian. Hence, our method is a reduced-variance method for estimating the inverse Hessian. As such, our work should be widely applicable. 2) Please see "To all reviewers". 3) We added a detailed calculation to the paper. The iteration complexity of our method can be computed by applying log to line 470 in Thm 1, then using standard arguments to isolate the counter k. The overall complexity of our method can then be computed by multiplying the iteration complexity by the cost of a single iteration.