We thank the reviewers for their helpful feedback and for catching the typos. We address the individual points below.$
REV_5

- Variance of relative error

By employing the work of (Pemantle &amp; Peres 2013）we can obtain bounds that hold with high probability; DPPs give strong concentration results in general. We will add this analysis and explain it.

- Partial Adaptive Sampling

We are happy to include this baseline.

- Time-Accuracy Tradeoffs

In this set of experiments we initialize the Markov chain DPP with Kmeans++. This initialization turns out to work as a warm start for the Markov chain and help in performance in first several iterations. 

- Experiment scales

We will add experiments on larger data in the supplementary material. The trends in the results are the same.


REV_6

- Figures:

Both Figs 5 and 6 are for the Ailerons dataset as in Figs 1 and 3. We will add explicit descriptions for the figures.

- Line 485:

Yes, here k should be the number of columns c to be sampled. We will make this consistent.

- Comparison with previous results

We thank the reviewer for pointing out the Kmeans Nystrom method. The main reason we did not compare with this method is that we are only considering the standard setting where we select columns from the kernel matrix. Kmeans Nystrom selects out-of-sample columns, which could give it an unfair advantage. However, we are happy to include the corresponding empirical comparisons.

- Comments on bounds

Although we are not sure at this point, we think both bounds may be improved. The improvement may come from a more precise bounding of the elementary symmetric polynomial ratio. We will comment on the tightness of the bounds in the paper. 


REV_7

Special thanks to the reviewer for the very detailed comments and suggestions.

- Eq.3.1

Indeed, although our current notation looks simpler, it seems to be ambiguous and cause confusion. We will make it more explicit, as suggested by the reviewer.

- Line 242

The ambiguity here is caused by the same issue as above. Here U_C, V_C are not related to U, V, i.e., they are not subsets of rows of U, V. Rather, they are new matrices (we used C as a subscript to show it is a new symbol, but this happens to collide with the previous ambiguous notation). To avoid ambiguity and purpose of illustration in this rebuttal, we use U', \Sigma' and V' to denote SVD matrices of B_{.,C} (not B_{.,C}^T). Then Line 3 in the previous equation goes to Line 4 because we take B_{.,C} = U'\Sigma'(V')^T into the expression and cancel all the redundant terms. We will change for a clearer set of notation here.

- Line 243

U'^{\perp} (which is U_C^{\perp} in the paper) is an r(K)x(r(K)-c) matrix (note here r(K) is the rank of K), since that U' obtained via SVD of B_{.,C} and B_{.,C} is r(K)xN matrix. We will clarify this and update the corresponding notation.

- Line 259 &amp; 264

Thanks for pointing this out. We will add more explanation here to make it clearer.

- About \alpha

We fully agree  -- when using the path coupling lemma, this is the probably best we can get. For kernels, alpha depends nontrivialy on the bandwidth of the kernel. We tested alpha on some of our data sets and will extend that discussion. We also observed that (paragraph of line 767), even with some alpha > 1 the chain seems to mix fast, indicating that large alpha does not necessarily result in slow mixing, suggesting that our analysis is conservative. 

- Comparison with Kmeans++

We observe that Kmeans++ is itself a decent method for Nystrom and KRR, and better than uniform. It can suffer from larger variance and higher average error than DPP-Nystrom.

- Adapfull sampling

Adapfull sampling is very related to kDPP sampling. kDPP sampling tries to sample subsets with probabilities proportional to the corresponding volumes. AdapFull is a non-randomized greedy method that, in each iteration picks the column with the largest gain in volume. In general, however, no good approximation bounds are known for such greedy methods. We will add discussions of Adapfull in the paper.

- Comparisons with Uniform Sampling

Regarding Section A.2, Fig. 8(a), a possible explanation is that the matrix's effective rank is smaller than c, i.e., only r < c eigenvalues are large, the other ones are very small. By nature, the DPP assigns low probability to any low-rank set of columns, since the associated determinant (volume spanned by these columns) is small. Here, due to the low effective rank, the determinant of any c x c submatrix is small, and the DPP can hardly distinguish between a set of columns that has effective rank r and one that has effective rank much less than r. Here, an r-DPP would be better. This may also be a result of numerical instability due to the fast decay in eigenspectrum, as observed in (Anderson et al. 2015).

- Figures

We will update figures to make the presentation clearer.