We are thankful to all reviewers for their careful and constructive comments.$ To Reviewer_2: 1) For comment "...mistake of putting \alpha_k <= 0 as the constraint...": The constraint \alpha_k <= 0 is correct. The same formulation of Multiclass SVM is used in the LIBLINEAR solver [1] (Appendix E, Eq 13). [1] R.F. et al. "LIBLINEAR: A library for large linear classification." JMLR 2008. 2) For comment on the comparison to more recent Extreme Multilabel solver SLEEC [3]: [3] Bhatia et al, Sparse Local Embeddings for Extreme Multi-label Classification. NIPS 2015. The authors of [3] released code to us close to the submission deadline, so we were not able to have its result by then, but now we have SLEEC's results: multilabel datasets | training time(SLEEC/ours) | test accuracy(SLEEC/ours) rcv1_regions | 1129s/8.8s | 91.0%/96.5% Eur-Lex | 2443s/435s | 74.2%/76.3% bibtex | 298s/5s | 65.1%/64.6% multiclass datasets | training time(SLEEC/ours) | test accuracy(SLEEC/ours) LSHTC1 | 10793s/952s | 12.8%/22.7% aloi.bin | 12200s/774s | 92.6%/96.3% sector | 164s/14s | 87.6%/95.3% which shows (i) SLEEC's training time is higher than ours by an order of magnitude (in [3] they did not report training time), and (ii) it has reasonable accuracy on multilabel but not on multiclass dataset, which is expected since SLEEC finds neighbors of each label via their co-occurrence [3]. In Multiclass there is no label co-occurrence. 3) How was accuracy measured for multi-label datasets? The measure for multilabel experiments is top-1 accuracy (precision). We will make this clear in the next version. 4) What is LMO (line 502)? It stands for Linear Minimization Oracle. We will make this clear in the next version. 5) Question on "tolerance of convergence bound to all the approximations": (i) Hessian Approximation (Eq 21): the convergence analysis Theorem 1 is based on the same quadratic upper bound (Qi), Theorem 1 holds for the algorithm without bi-stochastic greedy selection. (ii) Bi-stochastic Greedy Selection gives us 1/\nu multiplicative approximation and \eps_p/\nu additive error to the expected descent amount, which is reflected in Theorem 2 where the rate is decreased by a factor 1/\nu and only holds for iterations t \leq 2NQ_{max}\nu / \eps_p. 6) Question on Q ~ O(N): Q should be replaced by \bar{Q}=\frac{1}{N}\sum_{i=1}^N Q_i. As our objective function is not scaled by 1/N as in [4], the result should be adapted by replacing G(alpha) with G(alpha)/N and Q_i as Q_i/N^2, which results in a replacement of Q by \bar{Q}, and a convergence beginning when t ~ N. We will correct this and have a detailed proof included to be self-contained in the next version. [4] Simon L.J. et al. "Block-Coordinate Frank-Wolfe Optimization for Structural SVMs". ICML 2013. To Reviewer_3: 1) For comments on technical contribution: To our knowledge, this paper is the first showing that the search of active dual variables in Block-Coordinate Frank-Wolfe can be done efficiently by exploiting sparse primal iterate, while the maintenance of nonzero primal variables can be done cheaply through the sparsity in dual iterates, resulting in a complexity O(nnz(W_t)N+nnz(A_t)D) sublinear in the number of primal and dual variables. The simplicity of the algorithm is an advantage to practioners. It shows that a right combination of loss, regularizer and optimization algorithm gives surprisingly stronger performance than state-of-the-art approaches such as FastXML, LEML, SLEEC and VW-Tree. 2) For comments on nnz(W_t) and nnz(A_t): Although nnz(W_t)\leq nnz(A_t) does not hold for all iterates, nnz(A_t) is bounded by t, where t is the number of BCFW iterates, and therefore, even for \lambda=0, nnz(W_t) is bounded by |X|*nnz(A_t)/N, that is, the maximum memory footprint is bounded by "data size" times the average number of active labels per sample. More importantly, our experiment shows that the \lambda maximizing testing accuracy gives extremely sparse iterates as shown in Table 1. For example, on LSHTC-wiki(K=320338), we have nnz(A_t)/N=18.24 and nnz(W_t)/D=20.95. 3) For questions on notation \alpha_i and \alpha_k: We use \bold{alpha}_i to denote length-K vector [\alpha_{i1};...;\alpha_{iK}] and use \bold{alpha}_k to denote length-N vector [\alpha_{1k};...;\alpha_{Nk}]. 4) Motiviation for the Hessian approximation and reason of sparsity after projection. We minimize the dual objective w.r.t. the active set so any variable not in the set remains 0, which results in sparse update. A diagonal matrix is used to upper bound Hessian so the "active-set subproblem" has closed-form solution. It has been shown that Fully-corrective Frank-Wolfe that maintains active set has much faster convergence than vanilla Frank-Wolfe. To Reviewer_1: We appreciate all the detailed suggestions and will revise our paper accordingly in the next version.