We sincerely thank the reviewers for your time and efforts reviewing our manuscript!$ Since two of the reviewers are mostly satisfied with our manuscript (except minor things we are happy to take into account in a next version), we address below only the major concerns raised by Reviewer_5. Q: "If only one Li is nonzero, the sqrt(n) speedup cannot be reached" A: While this statement is true, it is a non-interesting case to study. If the smoothness parameters for x2,x3,...xn are all zero, then the convex function f(x) must be independent of such n-1 variables, and f(x) becomes univariate f(x) = f(x1). In this setting, all known accelerated (full gradient or coordinate gradient) methods perform the same because there's no need to even sample coordinates. Even ACDM/APCG are unnecessary in this setting. Q: "the results seem unsurprising (randomization improves convergence, accelerated descent has faster convergence)" A: While such intuition seems natural to a non-expert, obtaining our result is highly non-trivial and even surprising! - First, our result should be a surprise to optimization experts. Existing results APCG and ACDM have taken into account randomization and acceleration. However, they only obtained suboptimal running times and even some of the original authors don’t know if improvement is possible (private correspondence). They couldn’t do so not only because of technical difficulty (e.g., our paper used a different analysis framework from existing literatures), but also because of the incorrect choice of distribution. Focusing on the Euclidean norm for instance, existing sampling distributions are either uniform (APCG), or proportional to Li (ACDM). We propose in this paper to sample coordinates proportional to sqrt{Li}, and this was not observed before. - Second, our result should be a surprise to ERM experts too. As we have discussed in the paper, [Zhao and Zhang, ICML’15] demonstrated that sampling proportional to the squared norm of the feature vectors could speed up algorithms such as SDCA. They also stated that the same should hold for accelerated methods due to the Catalyst reduction. Our submitted manuscript completely challenges their results by proving that sampling proportional to the norm, rather than the squared norm, should be a strictly better choice for ERM. This should be a surprise at least to the readers of [Zhao and Zhang]. Q: what does "RCDM with [different probabilities]" mean A: We shall make it clear in a next version of the paper: RCDM defines a spectrum of algorithms by varying parameter beta. Each of them corresponds to a different sampling distribution and gives a convergence with respect to the L_beta norm. Q: the presentation of the paper A: Thank you for your detailed suggestions on improving the presentation of this paper. Indeed, due to space limit, we referred to some theorems/figures in the appendix, which makes the first 8 pages not very self-contained. We will at least include Algorithm 2 in the main body, and try our best to rearrange the rest of the paper in the revision.