Paper ID: 1170 Title: A Superlinearly-Convergent Proximal Newton-type Method for the Optimization of Finite Sums Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper considers minimizing average of finite number of convex functions and assuming that the overall average is strongly convex. A Newton-type Incremental Method (NIM) is proposed, which achieves local super-linear convergence rate and a global linear convergence rate. Experiments show that NIM is very effective on problems with large number of functions but very small number of variables. Clarity - Justification: The presentation is very clear. Significance - Justification: The main result of this paper is the proposed NIM method, which is essentially a second-order version of SAG. That is, it uses the same lazy-averaging idea of SAG on both the component gradients and component Hessians. I think the reason that it has not been tried before because of its clear limitation to problems with very small number of variables. Nevertheless, the proof of super-linear convergence rate, for cyclic order of picking the component functions, is interesting. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The idea of NIM, using lazy update for the average Hessian, is straightforward, and the established local super-linear convergence result is interesting. However, the main drawback of NIM, as compared with SAG, is its limitation to very small dimensional problems. In addition, for the global linear convergence, as the authors realized, the step size (mu/L)^3 seems to be too small. This also raise the question about what is the convergence rate c in Theorem 2. At least some estimate of its order should be given; would it match SAG? ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This work propose a Newton-like stochastic optimization algorithm for minimizing finite sums. The surprising aspect of the work is that it achieves a superlinear local convergence rate with a unit step size. Numerical experiments on low-dimensional logistic regression indicate that it may be useful in practice. Clarity - Justification: The text is easy to read, although more time should be spent in making the analysis easier to follow by the reader. Significance - Justification: If the superlinear convergence result is correct, this is a very significant result. There have been a large number of papers on stochastic Newton methods lately, and this is the only one that has shown a superlinear convergence rate (a result that I did not think was possible). If true, it significantly raises the bar on what we should expect out of a stochastic method and opens the door to a lot of new research on the topic. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): A minor issue is that you need some extra assumptions to guarantee that H is invertible. For example, you could assume that each f_i is strongly-convex. It's surprising that better performance was obtained with cyclic selection, as other authors have pointed out that cyclic selection requires a much smaller step size. Indeed, the global convergence rate of the method is very slow. The authors should briefly comment on this. It should be pointed out that the method is a variant on the MISO method of Mairal. The paper should comment on the (1-1/n) lower-bound for first-order methods in the FINITO work. It would be good to say why second-order information allows us to improve over this lower bound. I'm really skeptical of the proximal-Newton variant as it is currently presented. If you use iterations of the FGM method to solve the sub-problem, then it destroys the low iteration cost of the stochastic method. This is reflected in Figure 1, but I'm surprised that the approach is even competitive with the other methods. Wouldn't the method be much more effective by using proximal-SAG in the inner loop? This would make deciding when to stop more difficult, but this has been discussed in the SAG/MISO work. While SFO is cited, there have been quite a few Newton-like methods stochastic methods proposed lately. None of these achieve a superlinear rate, but please update the related work to discuss them. The experiments are far from extensive: only four datasets are used (with L1-regularized results not shown for two datasets) and the authors are using their own implementations. It's hard to draw conclusions from such a small number of experiments and "default" implementations of competing methods. The experiments would be more convincing if they were performed on a larger number of datasets and if they compared to the SAG/MISO/SFO codes of the authors. It's surprising that a unit step-size works in the experiments, and I suspect that this is only because logistic regression is so nice. Unfortunately, I did not have time to check the appendix due to the short review period and many papers to review. However, I will check it carefully and if no errors are found I will strongly advocate for the acceptance of this paper. ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper proposes an incremental version of Newton method, dubbed NIM, which has superlinear local convergence rate and linear global convergence rate when used to minimize finite sums. The NIM is also generalized to inexact setting which can be more efficient than its exact counterparts because it requires less computation per iteration. Clarity - Justification: The presentation of the paper was mostly clear. The claimed contributions are discussed in the light of existing results and the paper does survey related work appropriately. Significance - Justification: This paper studies an important problem in machine learning/optimization and provides an incremental algorithm with superlinear convergence rate. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): --- Summary of the paper --- The paper considers strongly convex optimization which is the finite sums of smooth convex functions and possibly a non-differentiable proximal function with an efficient implementation of proximal mapping. The paper proposes an incremental version of Newton method, dubbed NIM, which has superlinear local convergence rate and linear global convergence rate. The NIM is also generalized to inexact setting which can be more efficient than its exact counterparts because it requires less computation per iteration. The proposed algorithm can be considered as the extension of the SAG algorithm, but quadratic approximation of functions are used to incrementally update the gradient and Hessian. Roughly speaking, the idea is to approximate each individual function with a quadratic function, apply a Newton type update every iteration, but only update a single component in each iteration. Despite its superlinear convergence rate, the algorithms requires O(nd + d^2), where the first terms is due to fact the NIM requires to keep the gradient vector for all functions (similar to SAG), and the second term is due to the Hessian of sum function. The memory requirement can be reduced to O(n + d^2) when the algorithm is applied to linear prediction problems with an efficient implementation of algorithm. In terms of computation, the algorithm requires to compute the inverse of Hessian matrix, but since only a single function is updated every iteration (rank one updates), the computational cost can be reduced from o(d^3) to o(d^2) thanks to matrix inverse lemma. The authors support their theoretical results by conducting experiments on few datasets and compared the proposed algorithm to two first order methods and few other second-order methods. The used data sets are generally low-dimensional problems (the maximum number of features is 800 for dna18m dataset) which makes them good fit for the NIM algorithms as it suffers from memory and computational issues compared to first order methods. --- Quality, soundness and clarity --- The presentation of the paper was mostly clear. The claimed contributions are discussed in the light of existing results and the paper does survey related work appropriately. Regarding the quality of the writing, the paper is reasonably well written, the structure and language are good. The paper is technically sound and the proofs seem to be correct as far as I checked. However, the proof of convergence results in the current form is hard to parse. It would be better to give a sketch of the proof in a less technical manner, or explicitly itemize the main steps of the proof. ---- Evaluation ---- Overall the paper is well written, and the proofs seem to hold (though I did not go over all of them thoroughly). Despite the memory and per iteration computational cost issues I mentioned before, I think that the basic model incremental second order methods is interesting, so I lean towards acceptance (assuming the paper will honestly discuss the issues mentioned earlier). =====