First, we would like to thank anonymous reviewers for their valuable judgements on this work. They will help us to improve the paper regardless acceptance/rejection decision.$ Reviewers have noted that the proposed method NIM is limited only to small dimensional problems and hence is not very practical. This is true. However, our main concern in this work was about a principal possibility of a superlinearly-convergent incremental optimization method. NIM can be regarded as a generalization of classical Newton method for the case of optimization of finite sums. Thus, NIM and classical Newton have similar advantages and limitations, including one with small dimensional problems. We believe that our result on possibility of superlinear convergence will push research on Hessian-free or other efficient incremental optimization. In addition, there are some problems in machine learning with small number of variables including tuning parameters of RBF covariance function in models with Gaussian Processes. Reviewers have also noted that the current proof of NIM’s local superlinear convergence is too technical and thus hard to follow and verify. We think this is because we consider the most general situation (with prox operators and inexact solution of the subproblem). In the case when there is no non-smooth component (i.e. h=0), the proof is essentially the same and much simpler. Therefore, to make the proof easier to follow, we plan to consider this simple case separately before the general case in supplementary material. Concerning cyclic order of component selection and its comparison to random order and sampling without replacement (permute order), our experiments show that both cyclic and permute order lead to superlinear convergence with a slightly better performance of cyclic order. Behavior of random order is much worse and weird, sometimes with linear convergence effects. One argument for this is (1-1/n) lower bound and worst-case example from FINITO work. This bound still holds for NIM with random order since for the worst-case example 2nd order information doesn’t play any significant role. However, in case of cyclic or permute order the worst-case example is minimized exactly by NIM after the first epoch. Our global convergence rate for NIM with cyclic order is indeed very slow. However, from our point of view, this result should be considered mostly as a qualitative one (the method is globally convergent with a linear rate). In practice, NIM successfully converges with much larger step sizes (for example, unity step size is uniformly appropriate for logistic regression). The reasons for such a small step size in the theorem are the following. The standard global convergence analysis of the classical Newton method (in terms of mu and L) tells us that the step size should be of order mu/L in comparison with 1/L for the classical gradient method (see e.g. [Boyd&Vandenberghe, 2004]). The problem here is that the analysis does not use the fact that the matrix in the method is the Hessian; it only considers some matrix with eigenvalues bounded between mu and L. As soon as the true Hessian is considered (like in cubic regularization from [Nesterov&Polyak, 2006]) Newton method becomes more effective in comparison with gradient method even in the first stage of optimization (far from solution). Our analysis of NIM’s global convergence is based on IAG analysis from [Gurbuzbalaban et al. 2015], and we use only some scaling matrix with bounded eigenvalues instead of a true average estimate for the Hessian. Therefore, it is not surprising that we get a step size of several orders less than one for IAG, which in turn is less than step size for SAG. Hence, a small step size for NIM global convergence is mostly an issue of the current theoretical analysis than a reflection of a true NIM behavior. On reviewer 1 suggestion to use proximal-SAG instead of FGM for inexact scaled proximal mapping we can say the following. In our experiments on logistic regression, we have an average number of FGM iterations per NIM iteration in range [1.25, 2]. So there is no real need for faster algorithms for solving the subproblem. Moreover, the current stopping criterion for inexact solution used in the proofs assumes the inner method is able to calculate the full gradient of the auxiliary function, which is not true for SAG. In fact, the bottleneck in NIM running time is mostly in a number of NIM iterations. This running time can be reduced using mini-batches of appropriate size (as it is done in SFO “sweet spot”). We plan to add to our experiments with L1-regularization a “sweet spot” version of NIM. Concerning the experiments, we should note that for SFO we actually use a python code from authors. Also we plan to include into experiments results on a larger number of datasets and with SAG code from authors. However, we can say that these results are substantially similar to the ones already presented in the paper.