We thank the reviewers for their comments. We are happy to address the concerns of the reviewers and to fix all the typos found.$
(1) Regarding “direct analysis of non-strongly convex functions”: We obtain IFO complexity of SVRG for non-strongly convex case through a “direct analysis”. By direct, we mean analysis without adding a strongly convex perturbation. For example, please refer to Cor 2 in (Xiao & Zhang, 2014) where a strongly convex perturbation is added, so their analysis is “indirect”. In practice, one generally runs SVRG without adding regularization and it works well.
While a theoretically better rate can be obtained for non-strongly convex problems by an indirect approach, it is often hard to make it work well in practice. It is not clear if a similar strategy can be useful for non-convex problems, i.e., if simply adding a small strongly convex perturbation can help us obtain better rates for non-convex optimization.
*REV_1*
We will be happy to provide example of constants for Thm. 1 in the final version. The global linear rate for gradient dominated functions is mainly for theoretical interest. The growth condition in Eqn. (2) is often restrictive. An interesting scenario is that mentioned in Cor 6 (see Shalev-Shwartz (2015)). This scenario can arise when iteratively solving non-convex problems, where each iteration solves a convexified subproblem by adding a reasonably large strongly-convex perturbation. Nevertheless, we believe this is a first step towards exploring function subclasses where non-convex stochastic methods can achieve convergence to global optima.
*REV_2*
Regarding "it is not clear if ...": Please refer to (1) at the top.
When m = 1, SVRG corresponds to gradient descent. However, note that Table 1 lists the “best” IFO complexity achievable by each of the methods. For SVRG, this is obtained when alpha = ⅔ in Thm 3 (as noted in Cor 3), and in this case m should be of the order n^{3*alpha/2} (which is n since alpha = ⅔). We will edit the paper to make these points clearer.
*REV_3*
Dependence on L in the bounds: For ease of exposition, we assumed L to be a constant throughout the paper (please refer to Line 220-225). Due to this assumption, L is hidden in the constants used in the paper. We agree, it will be better to make the dependence on L explicit in our bounds. The key change would be that L will appear in the numerator of our bounds. For example, the upper bound in Thm 3 will now be (Ln^{alpha} [f(x^0) - f(x^*)])/(T nu). There are no other hidden constants except L. We will address other similar comments about the Lipschitz constant mentioned by the reviewer.
Gradient dominated functions were first studied by Polyak in 1963 for gradient descent (GD). As mentioned in Lines 460-464 (and in Table 1), and as the reviewer points out, GD achieves linear rate for this class. However, the IFO complexity of GD is higher than SVRG (please refer to Table 1). Our main contribution for gradient dominated functions is to show a “non-convex stochastic method” with linear rate and low IFO complexity.
We agree that function gradient being sigma bounded is severe, and as pointed out in the paper (Line 653-659) it is often required for analysis of SGD. However, note that non-convex SVRG does not require such an assumption, and hence, is much more favorable (Line 653-661).
The theoretical analysis requires the output to be picked randomly from all the iterates of the algorithm. However, for our experiments we observed that picking the last iterate works well.
We apologize for any confusion caused in Thm 3. There is no circular dependency. There exist universal constants mu_0 and nu for which the theorem holds. We will explicitly show the constants in the final version.
“Direct analysis for non-strongly convex”: Please refer to (1) at the top.
Regarding Cor. 2: The proof (in the appendix) follows from the following: When alpha < ⅔, the complexity of computing the average gradient at the start of an epoch dominates the complexity of all m inner iterations of an epoch (since m is of O(n^{3*alpha/2}) and complexity of 1 inner iteration is O(1)). Here, by inner iteration, we mean 1 iteration within an epoch. Hence, the “effective” complexity of 1 inner iteration is n^{1 - (3*alpha/2)}. Also note that overall we require O(n^{alpha}/epsilon) inner iterations (see Thm 3). Using these facts, we obtain the result for alpha < ⅔. For alpha >= ⅔, the complexity of all m inner iterations dominates complexity of calculating the average gradient, whereby the effective complexity of an inner iteration is O(1). Again, using Thm 3, the result follows. We will make this point clearer.
About Thm 7: We believe there is some misunderstanding regarding the notation. Here, in b = o(n^{2/3}) , we are using the ‘little o notation’ used in complexity analysis. By this we essentially mean b < n^{2/3} and not atleast n^{2/3}. So the theorem holds for small minibatches.