Paper ID: 1049
Title: Generalization Properties and Implicit Regularization for Multiple Passes SGM

Review #1
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): The paper explores formally the notion that stopping SGD early is an effective regularization strategy. In particular, it characterizes a bias-variance tradeoff between the number of epochs taken, the magnitude of the step sizes taken, and a parameter (beta) of the problem (interpretable as the difficulty of achieving minimum risk with a low-norm parameter).  For particular problem settings (e.g. smooth losses), assuming we know the parameter beta, the tradeoff can be optimized: given a step schedule, we obtain advice on how many epochs to run (and vice versa).  The results are all given under (a) standard regularity assumptions (numbered 1 and 3) and (b) the assumption that the parameter beta exists (numbered 2).

Clarity - Justification: Generally, the paper reads clearly, especially in the sections providing technical results. The writing in the introduction could be improved by having a succinct statement of the main goal and technical result, concrete and upfront. 

Significance - Justification: The conceptual view of SGM as a statistical estimator is appealing, and in line with other recent work in the community that revisits iterative procedures as estimators. Here the theory focuses on the many-pass setting (as opposed to the more commonly studied online or streaming view). In this sense, this is a step among others in formalizing/characterizing an existing common intuition, using familiar technical tools.

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): It seems to me that one can practically always think of the parameter beta as 1 -- especially of the parameter set is closed. Perhaps I am misunderstanding, but if that's indeed the case, I would suggest emphasizing this.  I would also suggest relating/comparing the generalization bounds in the corollaries (under, say, beta=1) to the current known ones for SGM, at least in the single-pass setting, as it is suggested by several of the corollaries. 

=====

Review #2
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): This paper provides an interesting perspective on the expected excess risk of stochastic gradient descent when running for more iterations than data points (SIGM, looping over the data) or precisely as many iterations as data points (SGM, which the existing theory supports), with different choices of step-sizes. The results herein broaden the scope of the results of Rosasco and Villa (2015) to handle general smooth and non-smooth losses. The results appear to be correct, are presented with important specializations that make them useful to the community, and the main message appears to be that one can get by with implicit regularization through early stopping to recover the same theoretical guarantees as single pass, while allowing for a less-specialized step-size which is agnostic of the unknown approximation-error parameter $\beta$ (provided one stops at a time related to $\beta$). Experimentally, SIGM appears to obtain poorer performance than SGM, so there may still be some gap in our theoretical understanding of SGM vs SIGM. 

Clarity - Justification: The paper was very clear.

Significance - Justification: The results of this paper are ``incremental'' in the sense that they broaden the scope of the results of Rosasco and Villa (2015); however, the increment is large enough to be significant as that paper only handled squared loss whereas the current paper handles general (convex) losses, both smooth and non-smooth. Developing generalization bounds, rather than optimization bounds, for SGM in cases where it runs over the data more than once is ripe for study, as in practice this is precisely what happens. I think the experimental results can also help guide future research (as discussed in Detailed comments).

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): \documentclass{article} \usepackage{amsmath,amssymb,amsfonts}  \begin{document}    I really do not have any major criticisms of this work. However, I am quite curious on the (what seems to be) inferior experimental performance of SIGM when run with a decaying step-size as compared to SIGM when running with a constant step size (which the authors noted). If the authors have some theoretical insight into this, it would be useful to include in the paper. If not, this would be good to understand for the future. Also, experimentally, SIGM appears to obtain poorer performance than SGM, so there may still be some gap in our theoretical understanding of SGM vs SIGM. If SIGM really cannot perform as well as SGM, this is unfortunate, as we then need to tune right from the beginning rather than relying on early stopping (which lets us defer the amount of regularization, ideally with some online monitoring process).  \subsection*{Minor comments}  I think the approximation error $D(\lambda)$ was meant to be defined as \begin{align*} D(\lambda) := \mathcal{E}(w_\lambda) - \inf_{w \in \mathcal{F}} \mathcal{E}(w) , \end{align*} where $w_\lambda := \arg \min_{w \in \mathcal{F}} \mathcal{E}(w) + \frac{\lambda}{2} \|w\|^2$.  The definition above is supported by the authors' statement on line 590-592 that ``Then the last term in the error decomposition corresponds to the approximation error.''  Also, in the math display on line 573-574, the last inequality can be strenghtened to equality.  \end{document} 

=====

Review #3
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): This paper thoroughly examines the generalization performance of stochastic gradient descent algorithms within a statistical learning framework. The considered stochastic gradient descent algorithms allow multiple passes over the data, which is in fact an important and also frequently encountered setting, especially in training neural networks.   Main results presented in this paper can be summarized roughly as follows:  1. Considering a general loss function, with properly chosen step-size and running iterations, the authors show that the averaged iterate of the stochastic gradient descend algorithm is risk consistent in the sense of Theorem 1;  2. Considering the case when the loss function is smooth, the authors show that non-trivial convergence rates can be obtained for both averaged and the last iterate, see Theorem 2, and Corollaries 2, 3, 4, 5. Various circumstances concerning the step-size of the algorithm and the number of passes over the data can be encompassed.  For instance, the step-size can be either fixed or decayed while the number of passes can be either one or multiple;  3. Considering the case when the loss function is non-smooth, the authors also show that meaningful convergence rates can be also derived under mild assumptions, although the obtained convergence rates may not as sharp as those for the above cases;  4. Noticing the trade-off role that the step-sizes and/or the number of passes play, the authors claim that they are in fact implicit regularizers, as indicated in the title of this paper.   According to my reading, the analysis presented in this paper is rigorous and sound. Theoretical results concerning the generalization performance of SGD under various circumstances are comprehensive and important.  

Clarity - Justification: This is a well-polished paper and is easy to follow. Paralleled results are presented in a neat manner. Several typos and confusing phrases might be needed to correct or reword, but not something fatal. 

Significance - Justification: This paper provides theoretical answers to the following questions: 1, How about the learning performances of SGD when it is equipped with a smooth loss, a fixed (or decaying) step-size and only one pass over the data? 2, How about the learning performances of SGD when it is equipped with a smooth loss, a fixed (or decaying) step-size and multiple passes over the data?  I notice that many of the obtained theoretical results in this paper are the first ones in the literature and indeed make a step forward towards understanding SGD from a learning viewpoint. I congratulate the authors for an excellent study.  

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Questions: 1. Line 14: I am confused with the claim “stability properties” since I fail to find any theoretical results concerning the stability of the considered SGD algorithms;  2. Line 573: I’m not clear about the inequalities presented here. In my understanding, it seems that without any restrictions, under the i.i.d assumption of the instances,  the empirical risk ( \mathcal{E}_\mathbf{z}(f) ) defined in Section 2 is an unbiased estimator of its population version ( \mathcal{E}(f) ). Am I missing something?  3. Line 637: I notice that the numerical experiments in this paper are all implemented with respect to the non-smooth hinge loss and so it might be better also to at least mention the authors’ observations (in even one sentence) for smooth loss cases in order to make the study complete.   Minor Comments: 4. Line 79: convergence results derived -> convergence results are derived; 5. Line 93: epochs -> passes. Comment: It might be better to use the terminology epoch or pass consistently. They are referred to the same thing in this study, according to my reading. This comment also applies to the subsequent appearances of this terminology and will be no longer mentioned; 6. Line 99:  in least squares -> in the least squares case; 7. Line 103: cross validation -> cross-validation. This comment also applies to its subsequent appearances; 8. Line 110: analysis are given -> analysis, are given. Comment: missing a comma; 9. Lines 151-154: The sentence here is not that clear and it might be better to rephrase it. Additionally, the right parenthesis is superfluous; 10. Line 190: The two formulas are labeled but are never referred to in the paper. Check also other labeled formulas; 11. Line 200: The notation for expectation with the subscript \mathbf{z}, and J is used without predefinition, although one may be able to infer from the text;  12. Line 251: conditions -> Conditions; 13. Line 266: a priori -> \textit{a priori}. Check also the subsequent appearances; 14. Line 268: parameters choices -> parameter choices; 15. Line 293: quantities -> terminologies or concepts; 16. Lines 301-304: It is not obvious to me that ``in this view … SGM can implicitly implement a form of Tikhonov regularization by controlling the step-size and the number of epochs” without referring to theoretical results after Remark 1. In my view, there is no indication that the two items play a trade-off role between “bias” and “variance” before Remark 1. Additionally, better also to use “and/or” instead of “and” in Line 303; 17. Line 321: missing a comma at the end of formula (10); 18. Line 331: combining -> from combining; 19. Line 385: bias variance -> bias-variance; 20. Line 432: act as -> acts as; 21. Line 435: duplicate phrase “in practice”; 22. Line 447:  and is natural -> and it is natural; 23. Line 449: a comma is missing before the word “albeit”; 24. Line 451: It might be not formal enough by saying “Instate Assumptions…”. Additionally, missing a comma before “and” (for the sake of style consistency); 25. Line 471: postponed -> postpone; in -> to; 26. Line 527: Here, the referred formula should be (3) instead of (1)? 27. Line 532: so called -> so-called; 28. Line 535: A superfluous preposition “on”; 29. Line 539: Also, the referred formula should be (3)? 30. Line 544: “The generalization properties of considering multiple passes” is not that clear; 31. Line 564: in the right hand side -> ON the RIGHT-HAND side; 32. Line 584: The statement “in a non-parametric setting, the existence of a …” seems to be not clear enough; 33. Line 589: the unique minimizer -> as the unique minimizer; 34. Lines 600 – 602: The sentence “for the non-smooth case a standard…” might be not as clear as intended; 35. Line 625: better to say “sharper error bounds” instead of “sharper errors”; 36. Line 642: is also shown -> are also shown; 37. Line 705: illustrate -> illustrates; 38. Line 751: “benchmark algorithm LIBSVM” is not clear. It’s SVM via LIBSVM? Check also its subsequent appearances, e.g., Line 802; 39. Line 777: It might be better to capitalize the name of the datasets (or at least their initials); 40. Line 816: missing the volume information; 41. Line 818: The name of the book should be in italic (for the sake of style consistency) and the initials should be capitalized also; 42. Line 826: Capitalize also the initials (N I P S); 43. Line 829: Incomplete reference; 44. Line 831: Incomplete reference; 45. Line 853: bayesian -> Bayesian; 46. Line 861: It’s better to make the citation style of articles from NIPS consistent.   

=====