Paper ID: 329 Title: Variance Reduction for Faster Non-Convex Optimization Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper describes an algorithm for finding a stationary point of function f(x) = \sum_{i=1}^n f_i(x) where the f_i are smooth but may not be convex. It describes a variant of SVRG (a variance-reducing SGD algorithm) that converges to an \epsilon-good stationary point (i.e., the norm of the gradient is at most \epsilon) in at most O(n^{2/3} / \epsilon) iterations. This is better than gradient descent, which requires O(n / \epsilon) iterations (where accessing each function f_i counts as one iteration). Clarity - Justification: Very clear. Most of the technical content of the paper is a fairly readable proof sketch. Significance - Justification: A n^{1/3} improvement over gradient descent is meaningful. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): - One thing that makes it a bit difficult to compare the results in this paper with typical results for online convex optimization is that, without further assumptions, a point with a small gradient need not be close to a stationary point. - The original SVRG paper states that its results can be straightforwardly extended to prove convergence to a local minimum for locally convex functions. Upon reflection, I'm not sure I agree: How do we know that the iterates will always remain inside the set S on which the function is locally convex? Do the authors of this paper have an opinion on that claim? If so, it may be worth adding here. - The experiments in this paper do not compare SVRG to gradient descent, which seems like an odd choice, since the paper sells itself as an improvement over gradient descent. Can the authors comment on why those experiments weren't done? ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper consider the problem of reaching an approximate stationary point in the context of non-convex optimization. Though the problem is basic, surprisingly very few results are known about it. In particular, it is known that gradient decent converges in time 1/eps while stochastic gradient decent converges in time 1/eps^2. The paper consider the SVRG algorithm recently suggested by Johnson and Tong. The main results shows that it convergences to a point that is eps-close to a stationary point in n^{2/3}/eps iteration. This result with an algorithm that is faster than gradient decent (the comparison to SGD depends on the relation between eps and n). The proofs require several new ideas. Clarity - Justification: The paper is well written Significance - Justification: The paper provides a solid optimization result. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): See significance ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper presents a theoretical result regarding the application of the SVRG to non-convex problems, along with experiments comparing against two baseline methods on two important problems. Clarity - Justification: Overall the presentation of the paper is polished and contains very few typos. I think the experiments are reasonably representative of the method as well. The proofs in the main body of the paper are sufficiently clear; more than is typical for conference-length papers. The presentation of the results needs to be improved as discussed below. Significance - Justification: It is difficult to judge the significance as the paper makes a very heavy set of assumptions in the main body of the paper. These assumptions are lifted in the appendix, however I've not checked the proof in the appendix. I'm also not certain of the correctness of the simplified proof contained in the main body of the paper either (see the detailed comments). Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): - The proof looks OK up to Eq 4.6, but I can't see how the black grad term (the non-stochastic gradient norm squared) is canceled between equations 4.6 and 4.7. Could you clarify this in the paper? - line 426 "literatures" is an odd phrasing, perhaps just "techniques" instead? - line 430. I'm not sure about this characterisation of bounding the variance using just the function value, I think it's an over simplification. SDCA, SAG & SAGA each do something more complex then that. - line 455 "turns" should be "turn" - line 608. Why is n on both sides of the equation here? It would be clearer to divide through by n. This also applies to line 131. - line 318. Equation needs a full stop. - In the experiments, did the measure of a "pass" include the additional computation of the full-data gradient by SVRG each pass? This should be included for a fair comparison against SGD. - In general the use of two very different plotting styles doesn't look professional. I would suggest reproducing the first two sets of plots using the style from the third set of plots. I know that redoing plots is time consuming so it's absolutely necessary. =====