We thank the reviewers for their valuable comments which will improve the clarity of the paper.$ == Significance of the contributions == Note that the ideas developed in this paper are not limited to BCFW for structured SVMs, but could also be extended to other optimization problems where decomposable gaps are available (e.g. SDCA, yielding a different algorithm than in Csiba et al (2015)). Moreover, BCFW is a generic optimization algorithm that has been applied to other applications beyond the scope of structured SVMs (e.g. video co-localization [1], sequence alignment [2]). We will emphasize the extended scope in the revision. * adaptivity * A central contribution in our work is the use of gaps to obtain *adaptive* algorithms, in contrast to just using (fixed) Lipschitz constants as is standard in the optimization literature. The gaps (computed for free in BCFW) play a crucial role in three of our contributions (non-uniform sampling; adaptive criterion for using the cache; and adaptive choice of breakpoints for the regularization path). We also presented the away step FW variant due to its natural link with caching. Our work is one of the first to provide *data adaptive sampling schemes* with proof guarantees (the other one being the independently developed one from Csiba et al. 2015 that cannot be easily extended to the BCFW setup, whereas ours could be extended to the SDCA one). Finally, our method is the first one allowing to approximate the *full regularization path* for structured SVMs, a significantly harder task than the standard regularization paths for binary classification or regression. ==Gap sampling== The interpretation of Th 2 is indeed missing and will be clarified in the revision. *R2, R4: Does gap sampling bound (Th 2) improve over uniform sampling one (Th 1)? Yes, it does improve in the regime where gaps are non-uniform enough (eta big enough in Th 2; an assumption empirically verified in additional experiments). Moreover, our example of Appendix A illustrates how gap sampling can be up to n times faster than uniform sampling or even Lipschitz sampling - we will provide an explicit example of this for structured SVM in the revision). In details: by simplifying the expression in Th 2, the factor in front of k is (eta C^otimes)/(n C^max), which needs to be bigger than 1 to improve over uniform sampling. As n*C^max >= C^otimes >= C^max, the condition eta >= n is sufficient for gap sampling to always improve over uniform sampling (in best case, eta = n^2). We can also refine our analysis to show that gap sampling dominates with no condition on eta when the curvature constants C^(i)'s are uniform. We note that to design a sampling scheme that always dominates uniform sampling, we would need to include the C^(i)'s in the sampling scheme (as was essentially done in Csiba et al (2015) for SDCA). Unfortunately, computing good estimates for C^(i)'s is too expensive for structured SVM, thus motivating our simpler yet practically efficient scheme. *R4: Th 2 only applies for exact gaps It is true, which is why we have experimentally assessed the applicability of approximate gap sampling, notably with the effect of staleness of gaps in Fig 1. ==Pairwise steps== *R2, R4: Missing convergence analysis for pairwise/away steps. The standard convergence analysis for pairwise and away step FW cannot be easily extended to BCFW, and thus we omitted its discussion due to space constraint, focusing instead on its empirical performance. We can show a geometric decrease of the objective only when *no* block would have a drop step (aka 'bad step'); a condition that cannot be easily analyzed due to the randomization of the algorithm. We believe that very novel proof techniques are required here. Please note that the arXiv paper mentioned by R2 has a fundamentally incorrect proof (they missed the above argument). Specifically, on p.16, they split the possibilities in cases A-D (good vs bad steps), forgetting that this implicit conditioning introduces a bias when taking an expectation over i's, bias that was (incorrectly) not included in their subsequent derivations. ==Caching== *R4: Th 3 and comparison to Shah et al. (2015) We agree with R4 that Th 3 is provided simply as a sanity check. In contrast to Shah et al. which either uses the oracle for *all samples* or the cache in a sequential fashion, our method chooses *for each block* whether to call the oracle in an adaptive way by using the block gap information. In the extreme case where just one block is hard to optimize and all the others are easy, our method would be able to call the oracle only on the hard block and use the cache everywhere else, yielding n times less oracle calls than the method from Shah et al. New references: [1] Joulin et al., Efficient Image and Video Co-localization with Frank-Wolfe Algorithm, ECCV'14 [2] Garreau et al., Metric Learning for Temporal Sequence Alignment, NIPS'14