We thank the reviewers for their input and would like to address their main comments.$
#Reviewer 1
Generality of the method and adaptation to a strategy like the one proposed by Hu et al, 2009. -- We agree with the reviewer on the goal of working out a more general theory that applies to a wider range of methods, and not just SAGA. In fact, we are currently working on an analysis of other gradient-based optimization methods that have a dependency to the number of training datapoints. It turns out, however, that such extensions are non-trivial and require additional ideas and tricks. The mentioned reference of Hu et al. 2009 is a good one in this context and we will certainly pursue it. However, we feel the above extensions are beyond the current scope of the paper.
Connections with "curriculum learning". -- Thank you for pointing this out. The adaptive sample size strategy we developed abstractly lies in the “curriculum learning” area. The main difference between this area and our work is that our strategy is algorithmic driven instead of being data driven. In other words, the convergence of the optimization procedure determines our sample size strategy. Curriculum learning, however, selects “easy examples” for optimization first and adds difficult samples after training on the initial samples. We will include this reference in the revised version.
#Reviewer 2
Typo in Lemma 1, line 142. -- We will correct this mistake.
“O(1/n) generalization bounds are already known in literature in non-realizable setting for strongly convex objectives”. -- Reviewer 2 is correct and we will make this more precise in the revised version.
Important references to be mentioned in the paper. -- We will add all the references pointed out by reviewer 2. The work of Wang et al. which is indeed quite relevant although our approach differs in a number of points. Their work, for example, does not suggest any new sampling strategy and hypothesizes that SGD is optimal in terms of generalization error as long as lambda = 1/sqrt(m) (where lambda is the regularizer). Note that our analysis actually shows that dynaSAGA enjoys a better generalization error than SGD in this setting.
#Reviewer 3
"some notations are not explicitly define such as E_a and C_S line 142" -- E_a represents the expectation over the algorithmic randomness of the optimization method. C_S denotes the initial error that is defined in equation 12 of the appendix. We will clarify these notations in the final version.
“it is not clear which version of dynaSAGA is used in the experimentation of the main paper."
-- We use the ALTERNATING strategy as mentioned in section 4.4.
The typos will be corrected and the few missing notations will be explicitly defined or made clear. We will also add all the references pointed out by the reviewers.