We thank the reviewers for their valuable comments and helpful suggestions. $ ** Dropout and gradient clipping: Reviewer_5 is correct that we should be more precise about gradient clipping. A common method to clip gradients is to set the update to be equal to grad f(x) if the norm of the gradient is less than or equal to B, and equal to grad f(x)/\|grad f(x)\|*B if the norm is greater than or equal to B. This enforces a maximum gradient norm of B. Note that this mapping will be beta-Lipschitz if grad f(x) is beta-Lipschitz. This can be seen from the chain rule. Reviewer_5 also is concerned about dropout. We first note that we are assuming that the dropout rate is constant, however that zeroed-out coordinates change from iteration to iteration. We also assume that the randomness in the dropout is coupled, i.e. that w and w’ see the same coordinates dropped out in each iteration. This is similar to how we assume that w and w’ see the same random ordering of the data. This is also how we implement dropout in our experiments. Independently, we noted a typo in our definition of dropout. ``D\nabla f(w;z)’’ should be ``D\nabla f(Dw;z).’’ With this corrected definition, the constant L does not increase and the smoothness \beta improves by a factor of p. ** Theorem 5.4: Assinged_Reviewers 5 and 7 asked for some clarifications with regards to Theorem 5.4. We first note that we propose a stepsize alpha = D/L*(1+2T/n)^{-½} If T is a constant multiple of n, this is only a constant times smaller than the normal stepsize (at most 1.8 times smaller). We do not suggest using a stepsize of 1/sqrt(kT). Moreover, we would like to emphasize that the bounds we derive are not directly comparable to classical bounds derived for the case of T=n. As we describe in the manuscript, standard techniques can only get a bound on the excess risk of SGD if each data point is touched exactly once. In this case, the excess risk is R[w] <= R_pop + DL/sqrt{n} Here, R_pop is the minimum value of the risk. With our analysis, the excess risk after E epochs using our stepsize is R[w] <= R_erm + DL/sqrt{n} *sqrt(E+2) Where R_erm is the expected minimum empirical risk. Now, R_erm <= R_pop so these are not comparable bounds. ** Experiments: Assigned_reviewer_5 and Assigned_reviewer_6 both point out that our rates appear sublinear in alpha T/n. This is stronger than what we conjecture, but is indeed what we observe in the experiments. It would be quite interesting to investigate a sublinear scaling in theory.