We thank all the reviewers for their efforts on verifying our paper! We will take into account all the minor suggestions. Below, we address only the major concerns.$
Q by R5: "I’d like to see experiments with parameters tuned using prior info or cross validation"

A: First, we fully agree that the suggestions (cross validation, etc) can improve the experiments, and we promise to add them to our paper. Adding such experiments won’t be hard for us because we wrote a code base for SVRG, SAGA, SAG, SVRG++, SDCA, SGD. We promise to release the code after the paper gets published. 

Regarding existing experiments, we were truthful and fair in all the comparisons. There’s in fact only one tuning parameter (the step length eta) in SAGA, SVRG, as well as SVRG++. Although doing in hindsight, we have tuned them automatically and equally in the same discretized set {10^k, 3*10^k | k is integer}. We believe this is a fair comparison showing the outperformance of SVRG++.
[[ If more details are wanted, SVRG has 3 params, (m, T, eta), and SVRG++ has 3 params (m0, S, eta). (1) SVRG’s paper selects m=2n without tuning, and we select m0=n/4 without tuning. (2) T, the number of iterations, need not be tuned for SVRG; similarly, our S need not be tuned. (3) eta, the step length parameter, requires tuning for both SVRG and SVRG++. ]]


Q by R5: "real examples for non-convex fi?"
A: Interesting! We can include PCA with real-life datasets in our next version! We used synthetic data in the submitted version only because we wanted to control the upper/lower smoothness parameters in order to plot Figure 2.


Q by R8: "where does 3l/2 come from" and "how to go to \sigma/10"
A1: the 3l/2 term on RHS of line 1512 is correct. In the very front of the RHS, there’s another factor 4 eta L / (1-eta L).  These two factors, when multiplied together, total to 6 eta Ll / (1-eta L). This is exactly what R8 suggested.
A2: to show sigma/10 on line 1518, it suffice to prove 6 eta L l / (1-eta L)<=sigma/10. This relies on eta = min{1/21L, sigma/63Ll}, which implies 6 eta L l / (1-eta L) <= (6sigma/63)/(1-1/21)=sigma/10.


Q by R8: "it seems once iterations hits L/eps the convergence should be exponential, which is not shown"
A: For non-strongly convex functions, theory predicts an L/eps convergence rather than an exponential convergence for large T. It may be easier to see this via the following plot: go to Wolfram Alpha and enter "Plot[InverseFunction[1000*Log[1/#1] + 50/#1 &amp;][x], {x, 0, 10000}]". This is the theoretical convergence plot. 


Q by R8: "what is a full pass over the data?"
A: A full pass = computation of n stochastic gradients \nabla f_i(x). We count the snapshot full-gradient computation of SVRG and SVRG++ as a full pass too. 


Q by R7: "can lambda be 1/n?"
A: Thanks for catching a typo! lambda can be as small as 1/d so our result can improve a factor of 1/d, which is still significant. Following the scaling of Garber-Hazan, matrix A=sum ai ai^T / n has trace at most 1. This implies lambda>=1/d so lambda is not lower bounded by a const. 
Q by R7: "what’s the connection to [1]"
A: Interesting! Our question comes from the ERM community where SVRG, SAGA are the standard algorithms so we didn’t know this oracle-model result. After some careful reading, we believe our results are stronger in various ways. First, they only support bounded domain but we support proximal terms. Second, their result used average point as the starting vector which is less natural or efficient; our theory works for the last point which gives better performance in practice and was predicted by [Johnson-Zhang]. Third, our result supports non-sc and sum-of-non-convex objectives together which does not fit into their framework. Fourth, their result does not apply to the interesting Lasso and Logistic problems. This is perhaps one of the reasons they don’t have experiments included in their paper?


Q by R7: “what’s the connection to Lemma 11 in [2]”
A: 
After careful checking, [2] provides a similar result to our Lemma C.1 but only in a special case of PCA and only in the special case for strongly convex functions. [2] is not a machine learning paper so even Hazan himself did not know [2] until a few weeks ago (private correspondence) -- neither do we. Hazan was informed about our result (which improves Garber-Hazan’s PCA by lambda) back to late 2015. Thus, strictly speaking, [2] should be regarded as a concurrent work compared with ours. Furthermore, our result is stronger in many ways. First, we introduce lower/upper smoothness for the first time for our focused class of problems. Second, our Section6+Thm6.1 (which use Lemma C.1) solve a much larger class of problems comparing to PCA in [2]. Third, our Section5+Thm5.1, SectionD+ThmD.1 (for other sum-of-non-convex problems) have nothing to do with Lemma C.1 or PCA so are not related to [2]. (For instance, our Thm5.1 can be applied to fiedler vector computation which is not PCA.)