We thank the reviewers for their valuable feedback and comments which will be certainly helpful for polishing the paper. $
Reviewer_1:
In theory modified SVRF (without restart) only provides a constant speed up as we point out at the end of Section 3. However, it is perhaps not so surprising that in general Nesterov's acceleration technique provides better theoretical guarantees while not always better performance in practice, since it is very sensitive to the tuning parameters. This has been observed in previous work too, e.g. "On the importance of initialization and momentum in deep learning", Ilya Sutskever, James Martens, George Dahl, Geoffrey Hinton.


Reviewer_3:
1. About notation: it is quite common to drop index in pseudocode for more concise presentation. For rigorousness, in the proof we always emphasize that we first consider a fixed iteration t (e.g. Line 416 and Line 561).

2. In the proof of Lemma 3, D_t \leq D is not true for case (c) and we DID NOT use this fact to prove the bound E[|y_0-w^*|^2]. Please look at Line 584-585 where we prove the bound (note that the last step is a strict equality by the definition of D_t).

3. Line 697: Actually, s is not 2 here, but instead is any index between 1 to N_t (since it corresponds to the index k in Line 7 of Algorithm 2), which is also why (s+1)^2 \leq (N_t+1)^2.

4. Line 695: You are right that we also need a bound on \E[f(y_1) - f(w_*)], but we indeed have proven this in Line 597, which is the base case of the induction. We will make this point more clear in the final version. Thanks for pointing out the confusion.


Reviewer_4:
General remark: we'd like to point out to this reviewer that variance reduction has not been studied in the context of projection-free methods before. Projection-free methods are very different than gradient-descent variations since they replace the fundamental computational primitive from projection to linear optimization. Hence, they are effective for very different kinds of problems (such as those we experimented on). 

1. Line 73-75: here we try to state the main question in one concise sentence to highlight the key topic of the work, and we further explain this in details in Section 2. Perhaps "how fast" can be replaced by "What running time", but apart from this, could you please point out more specifically which part of this sentence is not clear?

2. Writing proofs in a two-column format is challenging in general but we will try our best to improve the readability of the proofs in the final version.

3. STORC implementation: we set beta_k and eta_{t,k} according to the theory (Line 527), which we briefly mention in line 736-737. We will make these details more clear in the final version.

4. "G-Lipschitz condition seems fairly strong": that is why we avoid this assumption in most of our analysis (except case (b) of Theorem 2), while most previous work DOES make this assumption.