Paper ID: 573
Title: Train faster, generalize better: Stability of stochastic gradient descent
Review #1
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): The paper studies the stochastic gradient descent and uses the notion of algorithmic stability to analyze its generalization error. The paper makes several claims: 1. For convex problems the stability (and hence generalization error) scales as L^2/n \sum_t \alpha_t where n is the number of examples, L is the Lipschitz constant and alpha_t's are the step sizes. Additionally the paper obtains excess risk bounds by trading off the optimization error and generalization error. The bound is at most a constant factor worse than classical analyses but applies when making multiple passes over the data (as is common in practice). Other analyses of SGM do not apply for multiple passes. 2. For strongly convex problems the stability is independent of the number of steps. 3. For non-convex problems the generalization error grows sub linearly with the number of steps (T) and decays as 1/n (with appropriate step size). In this direction the paper shows that several popular heuristics improve the dependence on T, thus giving partial justification for these heuristics. No excess-risk bound is provided for the non-convex case (unsurprisingly). The paper also experiments with SGM and neural nets on popular datasets. The experiments are not to obtain state-of-the-art performance but rather to assess the effects of training time.
Clarity - Justification: The paper is very easy to read.
Significance - Justification: The paper studies an important problem and makes some progress. The results are not groundbreaking and some of the ideas are not new. The results are nice.
Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): My main concern with the paper is that it oversells the results a bit. Apart from the result on Prop 5.4, the paper makes no formal claims about excess risk, but some of the exposition offers suggestions that are based on reasoning about excess risk. For example the recommendation on line 88 about focusing "on minimizing model training time," is not factually incorrect, but not so useful since it isn't obvious that any chosen model could possibly reach the desired error level. I don't find this to be a serious issue, but a simple fix might be to remind the reader that generalization error is not the same as excess risk relatively early in the paper. This would frame the results properly for the reader. Experiments: 1. I don't see any results on the MNIST dataset in the main text. They appear in the appendix but I'm not sure of the point since broadly the results are similar to the CIFAR10 results. Maybe it can be removed? 2. I do not see "the linear relationship between generalization error and step size," in Figure 1. At 20 epochs, between 0.0008 and 0.0032 the generalization error roughly doubles, but after that it most certainly does not. Am I missing something? Is this caused by the step size become too large?
=====
Review #2
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): In this paper, the authors present a stability analysis for stochastic gradient descent. Such a stability bound leads to a generalization bound that cannot be established by classical analysis. In particular, the classical result bounds the excessive error of SGD when the algorithm takes a single pass over the dataset. The analysis in this paper allows the algorithm to take multiple passes over the dataset.
Clarity - Justification: The paper is well written and easy to follow
Significance - Justification: The results in this paper is interesting and fills the gap in the study of SGD. There are some limitations though, as will be specified in the detailed comments.
Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): (1) In my perspective, the main contribution of this paper is Proposition 5.4, which establishes the excess risk bound for SGD even if the algorithm takes multiple passes over the dataset. Suppose that the SGD takes k passes over the dataset. Proposition 5.4 holds if the stepsize is chosen on the order of 1/sqrt(kT) (this is specified in the proof). It is a little weird, as the ordinary SGD stepsize should scale as 1/sqrt(T), independent of k. The smaller stepsize required by Proposition 5.4 is an artifact of the stability analysis. My concern is that this requirement may weaken the significance of the result. If we want to execute SGD for multiple passes, the intent is to get a smaller training error. However, with a smaller stepsize, it is not clear if the goal can be achieved. In fact, in terms of minimizing the training error it may not be better than taking a single pass. (2) The limitation (1) may due to the fact that the stability analysis is not sufficiently tight. It will be interesting to prove Proposition 5.4 with the standard 1/sqrt{T} stepsize, but this may involve more careful analysis on the stability. (3) The classical bound holds for general convex functions, while Proposition 5.4 holds only for smooth functions. (4) The analysis for non-convex optimization has similar issues on the stepsize. The idea for the proof can be summarized as follow: for any T steps of SGD, if we choose a small enough stepsize that scales with T, then the algorithm will be stable, because the two SGDs with one distinct instance will not be able to move sufficiently far apart with such a stepsize. Unfortunately, the stepsize used in this paper may not be the standard one that people use in practice. This is a main point which I think the paper should be improved.
=====
Review #3
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): This work presents some interesting results related to stability and stochastic gradient descent (SGD / SGM). The core results, before considering specializations related to various tricks from the neural network community, appear to be correct and provide interesting insights, giving quite a bit of merit to the paper. The results themselves fall into two sets. First, taking the perspective of algorithmic stability to analyze stochastic gradient descent is (to my knowledge) a fresh approach, and the concepts developed and used ($\sigma$-bounded, $\eta$-expansive, the growth recursion) can see some re-use for other online stability analyses. Second, the results on the stability of SGD themselves, which easily implies in-expectation generalization bounds for SGD, appear to be new. The results for non-convex optimization in particular are the ones that are most exciting here. On a lower note, I find the extensions of the results for gradient clipping and dropout quite sketchy, and it is unclear if these results actually hold without further technical support from the authors. In particular, I think there are serious issues with the proof of Theorem 3.8 in the case of an algorithm that is still random even after conditioning on the randomness from the example, as is the case for dropout.
Clarity - Justification: Overall the paper is very clear. However, clarity suffers greatly when it comes to the extensions (dropout, etc.) and how the theoretical results support these extensions. Also, it is incredibly unclear how the authors interpreted the relationship in the top plot of Figure 1 to be linear (see 'Detailed comments').
Significance - Justification: Already of some significance is the machinery developed to support the authors stability analysis of SGD. The idea of analyzing SGD with a stability analysis appears to be new, and in light of the results for the generalization properties of SGD for non-convex optimization, the algorithmic stability perspective also appears to offer something new. This makes for a very important work, and it will be interesting to see in future works a proper study of dropout in terms of stability.
Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): \documentclass{article} \usepackage{amsmath,amssymb,amsfonts} \DeclareMathOperator*{\E}{\mathsf{E}} \begin{document} In Theorem 3.8, it seems obvious that one should tune $c$ rather than treating it as a constant, as it is a free parameter. Gradient Clipping: It is unclear if gradient clipping preserves the first claim of Lemma 3.6 (I did not consider the other two, as I assume we are in the non-convex setting), i.e.~that $G_{f,\alpha}$ is $(1 + \alpha \beta)$-expansive. The claim may still hold with a different amount of expansiveness, but this does not seem trivial. The authors should be clear on what they mean by truncation, scaling, and dropping of examples. For instance, I interpret dropping to mean that we now have $\|G(v) - G(w)\| = \|G(v) - w\|$, if the gradient that would have been used to update from $w$ to $G(w)$ is large. Dropout: There are numerous issues here. First, is the assumption of a constant dropout rate $s$ in Definition 4.3, invariant of $v$, at all realistic to how dropout is usually performed? If we accept this for now, it still is unclear if the first claim of Lemma 3.6 (expansiveness for smooth functions) still goes through in the case of dropout. Also, the proof of Theorem 3.8 as stated is too imprecise to unambiguously handle the case of dropout. Do you couple the dropouts affecting $w_T$ and $w'_T$? The proof of Theorem 3.8 (Theorem 3.12 in the appendix) is based on Lemma 3.11 in the appendix. This lemma upper bounds $\E [ | f(w_T, z) - f(w'_T, z) | ]$, where $\{w_t\}_t$ and $\{w'_t\}_t$ are sequences from applying SGM on $S$ and $S'$ respectively. However, even when each SGM instance is presented with the same example, the updates are still random due to the algorithm's internal randomness due to dropout. How is this randomness handled? Are the instances of SGM coupled so that their random bits used for determining the dropout are identical, or are they instead independent? I suspect the latter is true, but the authors need to be explicit on a choice and argue why that choice is the correct choice. Note that this problem crops up because of the absolute value in the expectation, without which the issues in the case of no coupling disappears. The main message of Proposition 5.4 appears to be that, theoretically, there is no benefit for running for more than $n$ iterations (i.e.~selecting $T > n$). Is the authors' intent instead just to point out that one can still maintain similar guarantees as $T = n$ by running for longer, so that, if there are empirical benefits for running longer, we can get those while still maintaining the same theoretical guarantees as for $T = n$? Line 700: ``The linear relationship'' - The relationship between generalization and step-size is definitely not linear. Were it linear, for any (or at least quite a few!) vertical slice of the top plot of Figure 1, the gaps between the fit-lines should be exponentially increasing, whereas they clearly are not; the relationship is instead sublinear. \subsection*{Minor comments} You should mention that Definition 2.1 is formally equivalent to Shalev-Shwartz et al.'s definition of strongly-uniform-RO stable (Definition 22 of that paper). On line 618, $J$ appears but was never used before. On line 857-858, I think dropout would increase parameter distance rather than decrease it if you do not use coupling in the internal randomness of different runs of the algorithm. \end{document}
=====