Paper ID: 705 Title: Starting Small - Learning with Adaptive Sample Sizes Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper investigates strategies to dynamically increase the dataset sample size in stochastic optimization algorithm. The proposed approaches are motivated by the trade-off between the statistical error and the optimization suboptimality. In particular, the authors propose to modify SAGA algorithm in order to incorporate a dynamic sample size schedule. Two schedules are proposed: LINEAR and ALTERNATIVE, where the author gradually increase the sample size as the training goes. A theorical analysis of LINEAR schedule is proposed showing competitive convergence rates. An empirical evaluation for both LINEAR and ALTERNATIVE is carried on both synthetic and realistic datasets. The evaluation verifies that by concentrating the computation budget on fewer data, a convergence speed-up can be observed. Clarity - Justification: Overall, the paper is clear and well written. Significance - Justification: To my knowledge, this is the first paper that is exploring dynamic dataset sample size during training. The paper results are promising and would probably be of interest for the community. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): 1) some notations are not explicitly define such as E_a and C_S line 142 2) while a comparison between LINEAR and ADAPTIVE is provided in the appendix, it is not clear which version of dynaSAGA is used in the experimentation of the main paper. ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This papers studies the smooth and strongly convex empirical risk minimization (finite sums) by combining the optimization and generalization errors and aims to exploit this fact in reducing the number of passes over data to match both errors. In particular, it proposes a novel dynamic sample set selection for stochastic variance reduced stochastic optimization methods that only requires 2n iterations (i.e., two passes over data) instead of naive \log n passes to match both errors. Clarity - Justification: As far as I checked, the paper seems technically sound. Regarding the quality of the writing, the paper is reasonably well written, the structure and language are good and overall the paper is an easy and enjoyable reading (except minore comments listed in the detailed comments). Significance - Justification: I think this paper asks an interesting question, and provides novel theoretical analysis of combined optimization and generalization accuracies in large-scale learning regimes followed by an efficient adaptive algorithm. In general, the results of this paper shed lights on applying stochastic optimization algorithms to solve large-scale learning problems by balancing computational and statistical guarantees. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): --- scope of the paper --- This paper is motivated by the fact (Olivier and Bottou, NIPS’08) that in many large-scale learning problems the optimization accuracy of empirical risk minimization over a finite sample is dominated by the generalization error, therefore any optimization accuracy that matches the statistical error would be sufficient for learning purposes. As a result, the trade-off between optimization accuracy and true generalization error, when combined and analyzed together, can be exploited for optimization advantages. This can potentially result in an effective way of dynamically controlling the sample size for optimization to match the statistical accuracy which is the main focus of this paper. The paper studies the problem in the context of one of the recently developed stochastic optimization methods with linear convergence for finite sums (i.e., SAGA by Defazio et al., 2014). In addition to its nice theoretical analysis, the paper introduces a novel algorithm which adaptively selects the samples and is able to improve the number of passes over data from $\log n$ to only TWO passes over data to bring the optimization error down to statistical accuracy. --- contributions of the paper --- For large-scale finite sums optimization, when the number of samples is large than the condition number $n \geq \kappa$, the linear convergence rate of (1-\frac{1}{n})^t after t iterations which is achievable using variance reduced stochastic optimization methods, results in a constant when the number of iterations is set to n, i.e., $t=n$, also called the one-pass optimization setting. This is far from the generalization error $1/n$. To match the generalization bound, one need to make roughly $ \log n$ passes over data or $n \long$ iterations, to bring the optimization accuracy down to the level of statistical accuracy. In contrast to this naive multiple passes over sample, this paper proposes an adaptive way with dynamic sample size, which provides optimization guarantees which matches the generalization error only after $2n$ iterations. Therefore, with only two passes over data (basically 2n iterations), the optimization error is forced to match the generalization error which is significant improvement compared to naive $\log n$ passes over data. The authors also complement the theoretical results with empirical study on real and synthetic data sets. --- comments --- 1- In Lemma 1, line 142, in the definition of $\rho_n$, it should be $\frac{\mu}{L}$, i.e., the inverse of condition number $\kappa$ 2- The authors mostly focused on O(1/n) generalization bound and claimed it holds for realizable case. But since the paper focuses on smooth and strongly convex finite sums, the O(1/n) generalization bounds are already known in literature in non-realizable setting for strongly convex objectives (Kakade and Tewari, NIPS’08 for online to batch conversion for strongly convex objectives and Sridharan, Srebro, Shalev-Shwartz, NIPS’08 called the fast rates) and it would be better the authors make this precise in the paper. 3- There are few important references that I believe deserve to be mentioned in the paper: The following paper also studies the same problem but has weaker results: @article{wang2016reducing, title={Reducing Runtime by Recycling Samples}, author={Wang, Jialei and Wang, Hai and Srebro, Nathan}, journal={arXiv preprint arXiv:1602.02136}, year={2016} } There are also other variants of stochastic optimization methods with variance reduction that are discovered in parallel to SVRG which is mentioned in the paper: @inproceedings{mahdavi2013mixed, title={Mixed optimization for smooth functions}, author={Mahdavi, Mehrdad and Zhang, Lijun and Jin, Rong}, booktitle={Advances in Neural Information Processing Systems}, pages={674--682}, year={2013} } @inproceedings{zhang2013linear, title={Linear convergence with condition number independent access of full gradients}, author={Zhang, Lijun and Mahdavi, Mehrdad and Jin, Rong}, booktitle={Advances in Neural Information Processing Systems}, pages={980--988}, year={2013} } @article{konevcny2013semi, title={Semi-stochastic gradient descent methods}, author={Kone{\v{c}}n{\`y}, Jakub and Richt{\'a}rik, Peter}, journal={arXiv preprint arXiv:1312.1666}, year={2013} } --- overall evaluation --- Overall, I think this paper provides strong theoretical results that fill in a gap in an important area. The immediate practical implications of this work may not be large, but I think this is fundamental enough that it warrants publication. ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper investigates the relevance of learning from data samples of increasing size with stochastic gradient procedures. The starting point of the paper is the trade-off posed by large-scale learning where statistical accuracy and optimization accuracy have to be balanced. The authors therefore propose an optimization scheme that targets an "ideal" tradeoff between those two accuracies. The proposed strategy simply consists in learning from data samples of increasing sizes (the samples considered are thus nested) upon which stochastic gradient descent is used: theoretical and empirical evidences of the relevance of the contributions are provided. Clarity - Justification: To my opinion, this paper is extremely clear given its associated technicality. The authors have been able to provide in the main text the minimal set of results/notation/definition for the reader to be convinced of the correctness of their contribution. The more technical results are postponed to the appendix (which I haven't read fully). Besides, the structure of the paper is nice, as it is straightforward to spot right away the keys of the contribution. Significance - Justification: The contribution of the paper is above average. It may have been more forceful if an added level of genericity had been added: here, the algorithmic contribution, DynaSAGA, heavily relies on the SAGA algorithm, and it would have been nice to provide a contribution less tied to the base SAGA algorithm. Connections with "curriculum learning" and other papers that try to focus on easy data first, might have been discussed (in the present paper, the easiness is carried by the size of the datasets). Still, I think the contribution is substantial. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): I essentially have one question: how does the proposed approach of optimizing on samples of increasing sizes would carry over with a strategy like the one proposed by Hu et al, 2009 on accelerated-gradient methods for stochastic optimization ? Typos: - which a a convergence -> which a convergence - line 303: the \pm sign is not the best way to decompose the empirical risk. The full decomposition may be provided. - line 369: {\cal X} -> {\bf X} - line 416: fulfils -> fulfills =====