We thank the reviewers for their helpful feedback. We will improve our presentation, and include the missing references in the related work. Detailed responses are given below.$ META-REV / REV 6 We apologize for the lack of clarity. The results in the paper are correct but need some more explanation. Concretely: >> "bias due to heterogeneous blocks” This is a valid and indeed critical point; we thank the reviewer for raising it. As the reviewer notes, uniformity of the coordinate blocks breaks for Alg 1 when the time complexity of different blocks is different. What we analyze in the paper is an oracle-based model, where the server receives possibly delayed updates that are iid uniformly distributed across all blocks. We will make this setting clearer in the paper. In particular, this oracle assumption holds for Alg 1 when the time spent to compute and communicate updates for each block depends only on the workers, not on the blocks. That is, this setup handles heterogeneous workers but not heterogeneous blocks. This oracle assumption becomes true for heterogeneous blocks, if the server pre-randomizes the coordinate subset S, and the workers randomly select within S until the server receives everything (though now, no delay is allowed). This pre-randomized setting is useful when it is advantageous to use large |S| and it is robust to heterogeneous workers when |S| is several times larger than \tau. Prop. 1 in Appendix D1 provides a computational result on this. We argue that our oracle model is useful in practice too. In several problems where BCFW excels (and when it make sense to parallelize), constraints are of the same form. Problems might arise when there are different sparsity levels in the way partial gradients are computed, and in this case an additional procedure would be required to ensure uniform time complexity across blocks (and depends on the maximum sparsity level). >> "Line 1566, proof of Lemma 5" E_{S-chi_j, ..., -1 | chi_j} Y takes expectation over the randomness in uniformly selecting coordinates in the past chi_j iterations conditioned on the delay chi_j. The total number of iid uniform samples from [n] is chi_j\tau. chi_j does not depend on j under our oracle assumption. We will improve the readability of the proof. >> “\varkappa depends on j and k” \varkappa is a random draw from a distribution of delays associated with the updates received by the server – it does not depend on j due to our oracle assumption, but it does depend on k. Indeed, \varkappa in the kth iteration depends on everything that happened in iterations 1:k-1 and our definition of \varkappa marginalizes over whatever happens before iteration k. And Thm 4 works for each \varkappa separately. We will revise Thm 4’s statement and make this point clearer. We will also include a corollary stating the final convergence rate, which depends on max_k \delta_k. REV 1 >> "line 690" "Fig 1" Sorry for the confusion. Fig 1 is about the effect of minibatching without delay. "Large minibatch size introduces error" is a typo - will fix. >> "no line search variants", "no Step 2b and 5 in Alg 1" Line search is not allowed in delayed updates, but is allowed for other forms of approximate subroutines. Will make it clearer. REV 4 >> "sample splitting" or "feature splitting" Our theory requires that the subproblem blocks are uniformly sampled. Once we have that, we can split across features or data depending on the problem. While feature splitting is natural, data splitting can be done when we solve with the dual parameter as in structural SVM. >> "overall computation/communication cost" Overall computation is measured in terms of the number of partial gradient evaluations and subroutine calls as in Thms 1/2. The subproblem in structural SVM for sequence labeling is solved using Viterbi algorithm that has linear complexity – the partial gradients are not explicitly computed. Overall communication includes the cost of broadcasting parameter updates and the results from workers, both are sparse. We will add a more careful discussion on this. >> experiments We will add the baselines. We have shown that our algorithm has smaller wall-clock time than BCFW which was shown to have better performance than other well-known methods in Lacoste-Julien et al. 2013. We will also report results for another dataset about 600 times larger than the current one in terms of #samples x #features. REV 5 Thanks so much for the detailed review and pointers. >> L323 "why additive" It adds to the difference in expectation. Pls see the additive/multiplicative error model in Jaggi (2013). >> A more detailed proof of L615 Will add a lemma. D^{\varkappa\tau} is concave in \varkappa, thus, E D^{\varkappa\tau} ≤ D^{E\varkappa\tau}. For Euclidean norms, we can further bound the second term by \sqrt{\E\varkappa} D^{\tau}.