Paper ID: 153
Title: Stochastic Variance Reduction for Nonconvex Optimization
Review #1
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): This paper investigates the convergence property of stochastic gradient method with variance reduction technique in non-convex setting. Faster rates than ordinary SGD, w.r.t. the gradient magnitude, are established for both smooth and non-convex problem and gradient dominated problem. Experimental result echoes its theoretical analysis of general non-convex problem.
Clarity - Justification: The paper is well-organized and easy to follow. It offers a joyful reading to the reviewer.
Significance - Justification: This paper studies a hot, important and interesting problem in optimization area, and presents a novel and meaningful analysis. The reviewer also appreciates the comprehensive discussion on applying the SVRG to various settings.
Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Pros: 1. This paper studies a hot, important and interesting problem in optimization area, and presents a novel and meaningful analysis. 2. The reviewer appreciates the comprehensive discussion on applying the SVRG to various settings. 3. The paper is well-organized and easy to follow. It offers a joyful reading to the reviewer. Cons: 1. Though the faster rates are developed, it is still not clear whether the SVRG outperforms SGD in reasonable iterations (e.g. T is a few times larger than n), as several complicated constants (rules) are involved in the complexity bounds. I suggest the authors showing a simple example of \mu, \nu and \gamma, etc. 2. The global linear convergence rate to the optimal solution is established upon the assumption that the function is gradient-dominated. However, the authors failed to illustrate any machine learning problem that matches this assumption.
=====
Review #2
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): The authors extend convergence guarantees for SVRG to the non-convex and gradient-dominated regimes (the former converging not to a global optimum, but to a solution with small gradient).
Clarity - Justification: Nicely written, with clear notation.
Significance - Justification: It isn't at all surprising the benefits of SVRG over SGD should hold for non-convex objectives, so in a sense, this paper doesn't tell us much that we don't already know. With this said, being able to say "this is obviously true" is no substitute for "here's a proof", and non-convex objectives are widespread enough (particularly in neural networks, as in the experiments section) that the results presented here are important.
Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The gradient-dominated result (corollary 5), which describes a particular class of non-convex problems on which convergence to the global optimum is possible, is a nice addition. At the end of section 4, you say "it is not clear if these strategies provide any theoretical gains for the general nonconvex case". Could you say more on this? At the extreme of m=1, isn't SVRG just gradient descent (which has a linear n-dependence (table 1), instead of n^{2/3})?
=====
Review #3
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): The paper analyses SGD and SVRG in the nonconvex setting. Complexity results to stationary points are established. Convergence to the global optimum is obtained for nonconvex functions which satisfy a certain inequality which also arises as a consequence of strong convexity. Minibatch variants are also considered.
Clarity - Justification: The paper is mostly well written. However, many results do not mention all the assumptions, and some results are hard to understand since either not all constants are explicitly mentioned, or it is unclear how large they may be. The paper has many issues of this type; some of them are outlined below.
Significance - Justification: There is a very limited body of work on understanding the performance of training algorithms, such as SGD, SVRG, SDCA and so on, on nonconvex loss functions. This is a good paper in this category, with several interesting and new complexity results. Some results are not useful nor insightful (explained below).
Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The notation F_n for functions of form (1) is used throughout, such as in the statements of many theorems. However, as defined, the statements would not be correct. A more precise definition is needed (e.g.,functions of form (1) with each f_ L-smooth). The set of such functions should therefore be described by 2 parameters: n and L. References to IFO framework for lower bounds should be removed as it is known that the lower bounds therein are incorrect. Table 1 does not include any information about the dependence of the bounds on L and other potentially dimension-dependent quantities. Global linear convergence of (deterministic) proximal gradient methods under the Polyak-Lojasievicz inequality (what you call tau-dominated f) was recently studies by Karimi and Schmidt. Hence, this particular insight is not novel. What seems novel is its application to stochastic methods for minimizing finite sums. Several results require sigma-bounded gradients (e.g., Thm 1, 8). What functions satisfy this condition? For instance, square loss does not. The result of Thm 1 is not practical for several reasons. Constant c (determining the stepsize) depends on the optimal function value (unknown) and on the knowledge of sigma (also unknown). Hence, the method is not practical. One would need to search for stepsizes – this is a major issue with vanilla SGD, and even a larger issue for this method. Moreover, the result is about the size of the best expected gradient over the course of the iterative process. The gradients can’t be computed without hurting the performance by the factor of n, nor stored. Theorem 1 does not list all assumptions needed to reach the conclusion. How is the output performed? Do you store all iterates and then choose x_a uniformly at random? Or do you draw this uniform variable prior to running the method and stop early? Or do you ignore the theoretical prescription and output the last point? Theorem 2 and the colloraries do not list all necessary assumptions. Theorem 3: mu_0 is chosen so that gamma_n > 0. However, gamma_n depends on m, and hence so does (potentially) mu_0. On the other hand, m is defined as a function of mu_0. This seems like circular reasoning. Some explanation is needed in the main text. Theorem 3. Quantity nu is not a universal constant, but depends on L, which means the conclusion of Theorem 3 is worse than presented. It is likely that several other constants which are claimed to be universal are not. Cor 2: How exactly is this obtained from Thm 3? The short proof is not entirely correct; although the conclusion seems to be right (modulo hidden dependence on L). Cor 5 also does not mention dependence on L. Strangely, Cor 6 does. Cor 5 needs to be made more precise. Convex case: Direct analysis of SVRG was done by Xiao and Zhang (A Proximal Stochastic Gradient Method with Progressive Variance Reduction) and in Konecny, Liu, Richtarik and Takac (Mini-Batch Semi-Stochastic Gradient Descent in the Proximal Setting). Line 556: SVRG-type method for convex functions via perturbation was done before by Konecny et al (semi-stochastic gradient descent methods). Section 5: Mini-batch reduced variance method was studied before in Konecny, Liu, Richtarik and Takac (Mini-Batch Semi-Stochastic Gradient Descent in the Proximal Setting). Theorem 7: Assumptions on f are missing. The result is not useful as the minibatch size is too large (at least n^{2/3}). As before, the result should depend on L as well (since nu_b depends on it). Theorem 8: I suspect L should appear in the second expression in the min. Citations: Some cited papers already appeared. Adjust the biblio accordingly. Sometimes you cite as NIPS; sometime you spell NIPS out. Sometimes you say ”arxiv preprint arxiv” and sometimes just ”arxiv”. There are many issues like this. There are numerous typos and small issues in the text. Check this all.
=====