Paper ID: 496 Title: Improved SVRG for Non-Strongly-Convex or Sum-of-Non-Convex Objectives Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper considers the stochastic variance-reduced gradient (SVRG) algorithm of Johnson and Zhang(2013) and some of its variants to minimize a convex composite cost function F where the smooth term is the sum of several smooth functions f_i. The paper goes beyond the state of the art in two ways: (i) While previous results provide convergence guarantees for SVRG only when F is strongly convex, the paper provides convergence rate for a variant of SVRG when F is not strongly convex (but all f_i are convex), which improves upon the performance guarantees of other algorithms for the same problem. (ii) For the case of not necessarily convex f_i (but F is still strongly convex), an bound is provided for SVRG that depends on the non-convexity of the f_i (measured as "lower smoothness"). Here the dependence improves upon recent results of Shalev-Shwatrz (2015). A combination of the two results are also presented, resulting in another SVRG variant, valid for the case when F is convex but not necessarily strongly convex, and the f_i are not necessarily convex. Experiments are provided to demonstrate the behavior of the algorithms (and other algorithms from the literature, e.g., SDCA, designed for the same problem). A very nicely written paper with strong theoretical results, but there are problems with the experiments. Clarity - Justification: The paper is nicely written and pleasant to read. To my knowledge, the related work is summarized adequately. Although I have not checked all the proofs, the ones I did are correct, and the rest also seem plausible. Significance - Justification: The results are quite important, since they provide the best complexity results for many composite minimization problems occurring in machine learning (specifically, in penalized empirical risk minimization). Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): While the theory part of the paper is very nice, the experiments are the contrary. Even in the theoretical part, the parameters of the algorithms are optimized in an oracle way that is impossible in practice, as they depend on quantities that are not available at the start of the algorithm (e.g., the suboptimality of the initial point). At this point I assumed that these bounds are only provided to indicate the potential capabilities of the algorithms (although it would be good to spell out the bounds for some more reasonable parameter settings). However, the experiments go one step further, and present the results with parameters tuned in hindsight. This has no practical meaning at all: in a similar way I could just tune the initial point and conclude that all algorithms converge in one step. Admittedly this is a slight exaggeration as tuning the step size (and other parameters) has less direct effect, the results are still meaningless. Therefore, I would like to see experiments where the parameters are tuned in an honest way (selected based on prior information, cross validation on a small part of the data/time, etc), otherwise they just ruin this very nice paper. In addition, some parameter settings are not provided, and error bars are not shown (based on how rugged the graphs are, I am not sure that even multiple runs were performed for these randomized algorithms). I would be happy (and wish) to raise my score to strong accept if the authors guarantee that they correct the experimental part. Since the authors also argue that the case of not necessarily convex f_i is important, it would be preferable to see a real example for this case, not just the synthetic ones presented. Minor comments: - l. 24-29: The last part of the abstract about PCA and DNN seem too speculative... - l. 50: Give the name of SVRG. - l. 186: Define gradient complexity and epsilon in the next line. - l. 200: "O(s)" -- what is s? - l. 280: missing "convex" - l. 296: for clarity, consider recalling that f is convex. - Around l. 310: Maybe mention importance sampling, which is another way to speed up SGD-type methods (although again orthogonal to the type of speedup provided by SVRG et al); see e.g., Afkanpur et al (2013) http://jmlr.csail.mit.edu/proceedings/papers/v28/afkanpour13.html, Zhao and Zhang (2014) http://arxiv.org/abs/1401.2753, or Mohri and Yang (2015) http://arxiv.org/abs/1509.05760 - l. 382: Since in the introduction smoothness was defined in terms of eigenvalues of the Hessian, it would be nice to mention the connection here. Also, note that l, L, sigma >= 0 (otherwise lower smoothness and strong convexity become the same (with different sign). - l. 394: Argue/mention somewhere that the minimum exists. - l. 408: Since equation (1.1) is hard to find (the label is not well-placed), maybe write "equation (1.1)" instead of "(1.1)" - l. 481: "the" epoch length - l. 491: What is m? Can you analyze the "auto_epoch" variant? - l. 512-513/591: "when each f_i is non-convex" should be "when each f_i is not necessarily convex". - l. 541: Conditional expectation is needed in equation (5.1). - l. 548: "originally" should be "original" - l. 574: Define the conditional expectation E_{i_t^s}. - l. 626: "is" -> "are" - l. 802: What is the distribution of a_i? - l. 870: The claim for delta=0 is not backed up by Fig. 2a, as delta=0 is better than anything else. Also, to show the claimed linear dependence on the parameters, maybe provide plots normalized by these quantities (e.g., by lL). - l. 1140: The inequalty you use is Young's inequality (Cauchy-Shwartz + arithmetic-geometric mean inequality) - l. 1232: "1/7L" should be "1/(7L)" - l. 1456: "Lemma A1" should be "Lemma B1" - l. 1676: "Convexity" -> "Convex" ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper provides a new analysis of SVRG that improves the gradient complexity from (n + L)/eps to (n + L/\eps)\log(\eps) for non strongly convex functions. This improvement is considerable if the smoothness constant $L$ is smaller than n. The authors also provide a tighter bound for sum of non-convex functions that depend on the lower smoothness and upper smoothness of the function as opposed to the maximum of both. Clarity - Justification: The paper is extremely well organized, the authors have comprehensive citations of related work and accurately describe the differences between their approach and previous research. Significance - Justification: The bounds provided by the authors for non-strongly convex functions have been given in the past for algorithms that use an l2 regularizer (to provide strong convexity) which might or might not be wanted. The outcome of these algorithms is highly dependent on the scale of this regularizer and therefore become hard to tune in practice. The fact that SVRG can still achieve the same complexities by a simple modification is remarkable. Moreover, the analysis given by the authors is clean, easy to read and the techniques used in the proofs will likely be used by other people in the community to provide better analysis of other algorithms. The bounds using the lower smoothness and upper smoothness of sums of nonconvex functions seem to be the first of its kind. Furthermore, they provide a continuous transition from easy problems (sums of convex functions) to hard problems (Hessians with large negative eigenvalues). Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Overall I enjoyed reading this paper. Here are some comments Line 223: Despite of -> In spite of Line 332: The definition of \xi_t is wrong. \grad_i f(x_t) - \grad_i f(x_t) + \grad f(\tilde x) -> \grad_i f(x_t) - \grad_i f(\tilde x ) + \grad f(\tilde x) On the experimental section. Could you explain what you mean by full passes over the data? Does this mean the number of times the full gradient had to be calculated? Do we know the Lipchitz constants for these experiments, it seems that once the number of iterations hits L/\eps the convergence should be exponentially faster and this does not show except maybe in Figure e). In the Appendix, proof of Lemma A.1: equation (6) does not follow from Cauchy Schwarz (at least not in one step), but from the duality of of the function x -> x^2. (I do not see any other way of bringing 1 - \eta L to the denominator. Line 1258: I don't think it is accurate to say that the sum telescopes. I think it would be better to say: recursively applying the previous inequality we get... Proof of theorem 5.1 Leamm A.1 -> Lemma B.1 I am having troubles deriving the first inequality. I don't see where the 3 l /2 comes from. It should be 6 \eta L l / (1 - \eta L) but without a specific choice of \eta this cannot be removed. Similarly, I don't know how to go from 3 l /2 to \sigma/10 in the coefficient accompanying ||\tilde x^{s-1} - x^*||. I am almost sure that this can be easily fixed but my score is conditioned on the authors explaining this steps. ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper extends the recently developed stochastic optimization methods with variance reduction, and in particular SVRG which is originally developed to solve smooth and strongly convex finite sums, to two other settings where (i) the objective is not strongly convex, or (ii) the individual functions in finite sum are non-convex. Clarity - Justification: The paper is written quite well (in terms of technical content, exposition and organization) and the claimed contributions are discussed in the light of existing results and the paper does survey related work appropriately. Significance - Justification: The paper deals with a topic of current interest, and the results appear useful in solving a wide variety of optimization problems in machine learning. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): This paper extends the recently developed stochastic optimization methods with variance reduction, and in particular SVRG which is originally developed to solve smooth and strongly convex finite sums, to two other settings where (i) the objective is not strongly convex, or (ii) the individual functions in finite sum are non-convex. Both settings are very important in machine learning and encompasses many practical learning problems such as Lasso or PCA. The main contributions of this paper can be summarized as follows: (1) For non-strongly convex finite sums, a new variant of SVRG is proposed which attains an O(n \log 1/\epsilon + 1/\epsilon) gradient complexity (i.e., \log 1/\epsilon calls to full gradient and 1/\epsilon calls to the stochastic gradient oracle). (2) For non-convex objectives, a new analysis of SVRG has been shown that attains similar rates as SVRG under specific conditions on the condition number and number of objectives in finite sum. On the positive size, the paper deals with a topic of current interest, and the results appear useful in solving a wide variety of optimization problems in machine learning. Moreover, the paper is written quite well (in terms of technical content, exposition and organization) and the claimed contributions are discussed in the light of existing results and the paper does survey related work appropriately. The proofs seem to hold as far as I checked most of them. On the negative side, although the proposals in this paper are valid, the overall contribution is limited, as the following considerations: (1) The algorithm in Section 4 for smooth and non-strongly convex objective is similar to the MixedGrad algorithm proposed in [1] below. Both algorithms divide the total number of iteration into exponentially increasing epochs (log 1/\epsilon), where at the beginning of each epoch, a call made to the full gradient oracle (as a result [n log 1/\epsilon] gradient computations) and inside each epoch s, 2^s calls made to the stochastic oracle (1/\epsilon calls to stochastic oracle in total). Therefore, from both algorithmic and gradient computation complexity both algorithm are similar. Moreover, as far as i checked, the proof in both papers is similar to some extent (to be fair, the analysis in [1] more involved as it considers constrained domains and aims to obtain high probability bounds). Therefore, I think the algorithm SVRG++ is not sufficiently significant in terms of new contributions. [1] article{mahdavi2013mixedgrad, title={Mixedgrad: An o (1/t) convergence rate algorithm for stochastic smooth optimization}, author={Mahdavi, Mehrdad and Jin, Rong}, journal={arXiv preprint arXiv:1307.7192}, year={2013} } (2) The analysis of SVRG in Section 5 for sum-of-non-convex functions also looks same as the Improved Variance Bound for SVRG (Lemma 11 in [2]) which is specialized to PCA problem. I carefully checked the analysis of this paper and the one proposed in Lemma 11 in [2], and it seems the analysis here is not substantially different or better than the one proposed in [2]. [2] @article{jin2015robust, title={Robust Shift-and-Invert Preconditioning: Faster and More Sample Efficient Algorithms for Eigenvector Computation}, author={Jin, Chi and Kakade, Sham M and Musco, Cameron and Netrapalli, Praneeth and Sidford, Aaron}, journal={arXiv preprint arXiv:1510.08896}, year={2015} } (3) In their third page of the paper, Line 268, the authors say: Garber and Hazan further applied SVRG to minimize this objective and proved an overall running time O((nd+d/\delta^2) \log 1/\epsilon). Our result improves this running time to O((nd+[\lambda d]/\delta^2) \log 1/\epsilon). Since lambda may be as small as 1/n, this speedup is significant in theory. This argument is not correct, because lambda should be some lower bounded constant, which is not small. Therefore, from this argument, it is not clear to me how valid this statement would be. =====