Paper ID: 305 Title: A Self-Correcting Variable-Metric Algorithm for Stochastic Optimization Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper proposes a novel quasi-Newton method, presents convergence guarantees for this method as well as empirical results. Clarity - Justification: Very clear presentation Significance - Justification: This work appears to be a significant advance in the area of quasi-newton methods. One issue I find with this work, however, is the disconnect between the motivation in the intro and the actual results. The work is motivated by large-scale machine learning optimization problems, but aside from section 4.2, the entire paper (algorithm, theory, experiments) is really focused on a method targeting small-scale, single machine problems. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): - other common methods for large-scale optimization like mini batch methods, e.g., [Jaggi/Smith/et. al, NIPS 2014], should be mentioned - the paper is motivated by large-scale ML problems, but the experimental results are all on small toy datasets in matlab. - given that the goal of optimization methods is to minimize some objective, experimental results should be reported based on the value of the objective, not the test error of the learned function. - references should be cleaned up and have a consistent format ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This work proposes a new view on stochastic quasi-Newton methods, emphasizing that the self-correcting property is a crucial property to maintain in the stochastic case. The authors propose a new method which guarantees that the Hessian approximation stays well-conditioned (a limitation of previous approaches in practice, even though the theories in previous approaches assume this). Some quite-small experiments indicate that it may be advantageous over a small subset of previous methods. Clarity - Justification: The paper is very clearly written and easy to follow, given the topic. Significance - Justification: The theoretical analysis provides a unique new perspective on stochastic quasi-Newton methods, which is quite different from other recent approaches. I believe that the paper gives a nice new perspective on these methods, and could get this area unstuck and moving towards practical methods that would be broadly adopted. On the other hand, the experiments are very underwhelming. The work would be much impactful if this issue were addressed (see detailed comments). Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The discussion of issues with stochastic quasi-Newton methods and a careful consideration of their convergence analysis is welcome, especially given the many recent papers that ignore crucial issues. The condition (9) is simple but I believe addresses a key issue, and I appreciate that the authors are honest about the difficulty that (13) poses in practice. I did not have time to check the appendix due to the short review period, but did find any obvious problems in the main text. While I really like the theoretical contribution, I wish that the authors had put more time into their experiments. This has three components: 1. The datasets are tiny. It is hard to take seriously an experiment where no competing method takes more than 2/3 of a second and the implementations are done in Matlab. Matlab is terrible for implementing stochastic gradient methods due to the loop overhead, and I don't think we should really be concerned about problems that take less than a second to solve. The second and third experiments are larger, but are still smaller than the datasets in related papers. Note that the runtimes for RCV1 don't really make sense (they are way larger than expected), I'm wondering if the authors are not properly taking advantage of the sparsity in their implementation of the regular stochastic gradient method. This might erase the difference as you can do many more stochastic gradient iterations. 2. The experimental setup is fundamentally flawed: the experiments search over many possible parameter settings before deciding on their final values and this time is not taken into account. Thus, the proposed method is not a practical method at all. While this is true of stochastic methods in general, users don't want to search over a larger number of parameters. Please consider committing to some practical choices (or propose a heuristic to choose these parameters) rather than simply showing that hypothetically-good parameters exist. 3. The experiments do look very promising, but there are a lot of competing stochastic quasi-Newton (and other related methods like Newton-like methods with subsampled Hessians or growing-batch strategies). Based on the experiments, I believe this is a promising approach but it did not convince me that I have any reason to believe it will outperform the many methods discussed in the introduction. It would be more convincing if the authors compared to several competing approaches, and if they really believe that the method works they should also release their code. (Some other authors have released their code, so these would be easy to compared to). Finally, does the limited-memory version of the algorithm actually have the self-correcting property? Please comment on this issue! (Right now it feels like a bait-and-switch to talk about BFGS's self-correcting property when in practice we'll typically use L-BFGS.) ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper presents a subtle modification of the BFGS update rule that makes it resist stochastic gradient noise (thanks to the self-correcting properties of the BFGS updates). This opens the way to second order stochastic gradient methods. Theoretical considerations are complemented by empirical evaluations on small datasets. Clarity - Justification: The paper does a good job navigating the complexity of the theoretical discussion. The description of the experiments lack a discussion of the mini-batch sizes and their impact on amortizing the cost of the additional calculations required by the proposed algorithm. This is a bit confusing because the graphs report training curves versus number of iterations (not cpu time) and because the table mention cpu time but do not say when this cpu times were measures (when the error reaches a threshold, after a number of iterations, etc...) Significance - Justification: The contribution is novel and opens the way to second order stochastic algorithms. The experiments are convincing. but the paper lacks a discussion of the cpu time consequences of the algorithm. This might be needed because the cpu time gains reported in the tables seem small... Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): See above boxes. The paper only lacks a clean discussion of the cpu cost of the additional computations and on the impact of minibatch sizes. These aspects are not very clear in the experiments, btw. =====