Paper ID: 586 Title: Variance-Reduced and Projection-Free Stochastic Optimization Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper developed variance-reduced variant of stochastic projection-free optimization algorithms. The paper present almost complete analysis with some proofs of technical lemmas included in the appendix. The convergence rates are improved with the variance-reduced stochastic gradients. Clarity - Justification: The notations can be improved. In particular I found the notation x_0 in Algorithm 1 and y_0 in Algorithm 2 is very confusing in the analysis. They should depend on t. It is better to include a super index t in x_0 and y_0. Significance - Justification: The convergence rate of stochastic FW algorithm is slow in the current literature. Variance reduced technique has been used to speed-up SGD for smooth and strongly convex optimization. This paper present nice theory of variance reduced gradient version of SFW. The new convergence rates are improved. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): There are several places in the proof are confusing and might have a problem. In Lemma 3, is D_t\leq D needed for proving E[|y_0 - w_*|^2]\leq D_t^2. For case a, b, D_t =D. It is OK. But of case 3, D_t\leq D is not necessary true, right, since D_t depends on mu. For smaller t, the inequality D_t\leq D does not hold. However, from the proof, I did not see why D_t\leq D is needed in case c for proving E[|y_0 - w_*|^2]\leq D_t^2. It is better to clarify. In the proof of Theorem 2, Page 7 line 697. You used the fact (s+1)^2 \leq (N_t+1)^2. Is s = 2 here? N_t = \sqrt{32\mu} in this case, I did not see why the fact holds here. Page 7 line 695, "The case for s=2 can be verified similarly using the term bound on E[f(y_0) - f(w_*)]". I think you also need the bound on \E[f(y_1) - f(w_*)] to bound f(z_2) -f(w_*), right, which you haven't prove it yet. These places need to be clarified or corrected before publication. ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): Two stochastic variants for the Frank-Wolfe optimization method are discussed, i.e., methods where the gradient is sampled. The new idea behind both methods is to use a variance reduced sampling scheme that has been invented Johnson & Zhang, Mahdavi in 2013. The idea behind the variance reduction scheme is to "shrink" a sampled gradient to a full gradient that has been computed at some snapshot point. The snapshot point is also updated during the run of the algorithm (in an outer loop). The first variant that is studied is just the the variance reduced version of the straightforward stochastic version of the Frank-Wolfe algorithm. Variance reduction here reduces the number of sample gradients that need to be computed from O(1/eps^3) to O(1/eps^2), where 1-eps is the prescribed accuracy of the algorithm. The variance reduction need O(log (1/eps)) additional full gradient computations. The second variant that is studied is a variance reduced version of a Frank-Wolfe variant for strongly-convex and smooth objective functions. Here, depending on the additional requirements, the number of sampled gradients that need to be computed can be reduced from O(1/eps) to O(log (1/eps)). Again at the cost of O(log (1/eps)) additional full gradient computations. The algorithms have been analyzed and implemented. In the experimental section of the paper the implementations are compared to their non-variance reduced counterparts and also to a SGD variant and its variance reduced counterpart. The experiments have been conducted for a multi-class classification with a logistic loss function. It seems that the variance reduced versions perform well in these experiments. Clarity - Justification: The paper is solid but not particularly well written. 1. Already the statement of the research question (lines 73-75) could be stated more clearly 2. Line 180: It should be enough to compute the top singula vector [so one 's' can be removed] 3. Line 470: Our second algorithm apply ?? 4. Typesetting in general: for me it is difficult to read chains of inequqlities to go over line breaks and large formulas in the main text, i.e., not in an equation environment 5. The description of the experiments is incomplete: for the STORC implementation, how do you choose beta_k and eta_{t,k}? Choosing these constants is critically important for the algorithm. 6. Table 2: Lipchitz -> Lipschitz 7. The G-Lipschitz condition seems fairly strong (though it has been used before). Significance - Justification: Again, this seems to be a solid contribution. It is like many submission rather incremental, here the novel idea of variance reduction is transferred to the Frank-Wolfe setting, which is a valid contribution and technically challenging in parts. The second and technically more challenging variant (STORC) uses ideas from Nesterov's and thus suffers from the same drawbacks, namely the algorithm (and not only its analysis) needs to know problem specific constants (here: Lipschitz constants and diameters of convex sets). STORC performs not so well in the experiments and it is not described how these parameters are set in the experiments. I would be very interested to see an experimental comparison to standard techniques (without gradient sampling), i.e., some nuclear norm regularized logistic loss minimization. There are a plentitude of algorithms for this problem that should be quite efficient on the data sets of the size considered here --- for larger data sets that can be different of course. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): I checked the correctness of the proof for Theorem 1 which follows standard arguments in the Frank-Wolfe setting. I did not check the much more involved proof of Theorem 2 (which seems to use similar ideas, but in a more complicated setting). ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper proposes two variants of Frank-Wolfe that leverage the recent variance reduction technique to address problems with objective represented by finite sum of smooth functions over domain represented by linear minimization oracles. One algorithm combines stochastic Frank Wolfe with variance reduced gradient; the other one combines SVRG with conditional gradient sliding. Both algorithms improve upon the complexity in terms of stochastic gradients and demonstrate better performance numerically. Clarity - Justification: The paper is well-written. The algorithmic ideas, theoretical analysis, and especially the comparisons to previous work, are clearly presented. Proofs seem correct. Significance - Justification: This work gives a very comprehensive study on stochastic Frank-Wolfe algorithms. The algorithmic ideas are straightforward but novel. The theoretical results complete the gaps in this line of research and are meaningful. Overall, the paper makes very strong contribution. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): One minor comment: In the numerical experiment, the authors implement a modified SVRF without restart and results show that this version of SVRF performs much better than STORC, while the theory suggests the opposite. Perhaps the authors could extend the analysis for this modified SVRF and provide some detailed explanation rather than leaving a mystery here. Also, SFW seems to perform quite comparable to SVRF and much better than the "theoretically-best" STORC, so it would be good if the authors could provide some insight on this (whether it is because the batch size or inner steps are not tuned properly for STORC, or due to some other reasons). =====