Paper ID: 964 Title: Fast DPP Sampling for Nystro ̈m with Application to Kernel Methods Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper proposes a determinantal point process (DPP) based procedure to pick columns of a matrix for Nystrom approximation. The basic idea is that DPP based procedure encourages a 'diverse' set of columns to be picked. The authors show that under a few assumptions, Markov-chain based sampling leads to fast DPP selection. Extensive experiments are shown on many (small scale) datasets. Clarity - Justification: The paper is well written with clear arguments, both theoretical and experimental. Significance - Justification: This paper provides another way of sampling columns of a kernel matrix for Nystrom approximation, which is interesting but there exist many other ways of learning 'diverse' set of points. Experimentally, it is an expensive process and it is not clear it will be of much practical use (similar to leverage scores). Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Overall, I find the paper well-executed. The motivation and arguments are crafted nicely. Showing errors both on the quality of kernel approximation as well as on an end-task (e.g., kernel ridge regression) enhances the appeal of the work. I did not go through the proofs in detail but the theoretical mass of the work looks well-grounded. I have only a few comments: * In Thm 1, authors provide a result on expected relative error in kernel approximation. What about variance? Can that be arbitrarily large? * A very relevant work on sampling was not compared against: partial adaptive sampling (kumar et al., ICML2009). In comparison with FullAdaptive sampling, it provides a much better speed-accuracy tradeoff. * In the experiments, the authors did not show error bars. The results in Fig 6 do not look convincing as I don't understand how DPP with almost no computation time (left side of the plots) is much better in accuracy than random sampling? * Finally, the experiments were all conducted on very small datasets (3000 pts) -- since computation time is the biggest threat to DPP based sampling, it will be good to incorporate some medium scale (50K pts) experiments at least. ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper is concerned with the problem of selecting landmarks (columns) for the Nystrom method in the context of approximating kernel matrices. The authors analyze DPP-Nystrom, a procedure introduced by Belabbas and Wolfe, 2009, which uses Determinantal Point Process sampling to select a diverse collection of columns (where similarity is determined by the kernel matrix). In addition to providing expected relative error bounds for kernel approximation and expected relative risk bounds for kernel ridge regression, the authors suggest using a Gibbs sampler to more cheaply draw samples from the DPP and provide a bound on the mixing time of this Markov chain in terms of quantities that can be estimated from data. Experiments on eight real datasets are conducted to compare the relative matrix reconstruction error and relative risk of DPP-Nystrom (with Gibbs sampling) with several alternative approaches to selection Nystrom landmarks. Clarity - Justification: The paper on the whole is very clearly written, but please note the following opportunities to improve clarity: -Fig. 5: Which dataset was used to produce this figure? -Fig. 6: Which dataset was used to produce this figure? -L485: Is it correct that I should think of the parameter k discussed in this section as the number of columns c to be selected? Significance - Justification: -Does the paper contribute a major breakthrough or an incremental advance? I believe this work represents a significant advance for Nystrom landmark sampling. -Are other people (practitioners or researchers) likely to use these ideas or build on them? Practitioners are likely to use DPP-Nystrom with Gibbs sampling, due to the demonstrated time-accuracy benefits and apparently quick chain convergence, and researchers will likely continue to sharpen the analyses. -Does the paper address a difficult problem in a better way than previous research? My main concern is that the authors do not compare with the K-means Nystrom algorithm of Zhang et al. 2008, which was found to be the best empirical performer amongst Nystrom variants in the time-vs-accuracy Nystrom benchmark paper of Kumar et al. “Sampling Methods for the Nystrom Method.” In particular, this variant outperformed many of the approaches used as benchmarks here. -Is it clear how this work differs from previous contributions and is related work adequately referenced? The authors claim that their bounds are not comparable to past high probability relative error bounds, but a high probability bound also implies a bound in expectation (by integrating the tail probabilities). Could the authors comment on how their expectation bounds compare with the expectation implied by existing high probability bounds? Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): -Are claims well-supported by theoretical analysis or experimental results? The authors rigorously bound he expected relative approximation error and the expected relative kernel ridge regression risks of DPP-Nystrom and provide an potentially valuable bound on the mixing time of the DPP Gibbs sampler in terms of three factors that could be estimated from data. The authors conduct a comprehensive suite of experiments demonstrating the both the time and accuracy benefits of DPP-Nystrom over several alternative Nystrom approaches. -Is this a complete piece of work, or merely a position paper? This is a complete piece of work and not merely a position paper. -Are the authors careful (and honest) about evaluating both the strengths and weaknesses of the work? Since the spectral norm bound was obtained from the Frobenius norm bound, it seems likely the spectral norm bound could be improved. Perhaps the authors can comment on which bounds they believe can be tightened. ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This work considers the problem of selecting a good set of "landmark" points for a Nystrom approximation. First, the paper provides a new analysis of the determinantal sampling algorithm from Belabbas and Wolfe (2009), in light of the fact that their algorithm can be viewed as determinantal point process (DPP) sampling. Second, the paper improves the runtime of the algorithm by using the Gibbs sampler from Kang (2013)'s work to sample the DPP. Importantly, this work also provides corrected analysis for the Gibbs sampler's convergence. (No guarantee of convergence was previously known.) Clarity - Justification: The paper is well-written and most notation is clear. It would be nice though to see more explanation for certain lines in the proofs. See the "detailed comments" section for more specific issues. (I would happily to raise this score to "Excellent" if the authors can address the issues from the "detailed comments" section, particularly those surrounding line 240.) Significance - Justification: While the basic idea of selecting landmarks proportional to volumes was previously proposed and analyzed in Belabbas and Wolfe (2009), that work did not make use of the fact that volumes are proportional to DPP probabilities. Thus, this work represents an interesting new perspective on the analysis of that algorithm. Additionally, the correction of the convergence proofs for the Gibbs sampler from Kang (2013) is non-trivial. This itself is a substantial contribution. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Line 130: We address a Gibbs sampler -> We consider a Gibbs sampler Lines 182-183: Might want to use "K" instead of "L" here, to be consistent with the rest of the paper. Equation 3.1: In the notation from Section 2, shouldn't the r(K)xS submatrix should be denoted by B_{.,S} rather than just B_S? This issue occurs many other places in the paper as well. Either define B as a special case or, better yet, just make its notation consistent by using B_{.,S} everywhere. I understand that using just B_S looks prettier, but (for me at least) it substantially detracts from the clarity of the exposition. Lines 235 and 276: Looks like the comma used elsewhere in the paper is missing here---K_{CC} should be K_{C,C} and K_{.C} should be K_{.,C}. These should be added in for clarity. Line 242: The paper states "U_C \Sigma_C V_C^T is the SVD of B_C". There are a couple of issues with this. First of all, this should be the SVD of B_C^T rather than B_C; K is defined to be K = B^TB = U \Lambda U^T, so this additional transpose is required for the U's to be on the outside. Second of all, the notation "U_C \Sigma_C V_C^T" is a bit misleading. As mentioned above, the matrix B_C really denotes B_{.,C}. So, if the eigendecomposition of B is V \Sigma U^T, then the eigendecomposition of B_{.,C} is V \Sigma U_{C,.}^T. In other words, there should be no "C" subscripts on V and \Sigma. Given these changes, it's not clear to me how line 239 becomes line 240 (the simplification to U_C U_C^T). It might help if the pseudoinverse expression were written out, along with a couple other intermediate simplification steps. Line 243: The paper states "we extend U_C to a full basis, U = [U_C U_C^{\perp}]". This U_C^{\perp} is a cxN matrix representing the projection onto the nullspace of U_C? If so, maybe state that to clarify what "extend to a full basis" means. Line 259: A little more explanation would be nice here. Perhaps state the expansion of the r(K)x1 vector b_i = V \Sigma U_{.,i}^T (or whatever it actually is). This, combined with the fact that the multiplication by U_C^{\perp} is a projection onto the nullspace of U_C, should make it clear why the sum becomes restricted to i \notin C. Line 264: Here it is stated "By (3.1) and Lemma 2 it follows that ...". But 3.1 is not used in the subsequent derivation; rather, it is used prior to this, in line 257. Instead, it might be good to say "By Lemma 2 and the means inequality 1/n \sum_{i=1}^n x_i \leq \sqrt{1/n \sum_{i=1}^n x_i^2}, it follows that ...". One idea for possible future work: The paper states that "Our bounds are not comparable to previous bounds ... that hold with certain probability since our bounds hold in expectation". The expectation comes from the fact that the set C is sampled from the k-DPP. Perhaps better bounds could be achieved if C were selected via an algorithm that attempts to (approximately) find the mode of the DPP? Current approximation algorithms for finding the DPP's mode are too expensive, O(N^3), but perhaps some less expensive mode-approximation could be derived. I didn't go over the Markov chain mixing proof in detail, but it seems generally correct. It's too bad alpha is so complicated, but perhaps nothing better can be shown when using the path coupling lemma. Can you give a little more intuition in the remarks section about alpha? Maybe show a couple example matrices, one for which alpha is large and the other for which it is small? In the experiments, the Gibbs sampler is initialized with K-means++. Since the plots in Figure 5 show a small relative error (at most 3%) even at iteration 0, this seems to imply that K-means++ is itself a decent method for choosing the set of landmarks. Does it compare favorably with the other methods in terms of training and testing error? It would be interesting to see this included in the error plots if so. It would be good to include a couple of sentences explaining the AdapFull method, since it is the one most closely related to the k-DPP method. In particular, since DPP probabilities are equivalent to volumes, can you give a quick explanation of how their "volume sampling" differs from k-DPP sampling? Perhaps use a darker yellow for the RegLev lines in all the plots; they're a little hard to see. In the supplementary material, Section A.2, Figure 8, Subfigure (a), is there any intuition as to why uniform sampling is out-performing all other methods when there are >= 40 landmarks? In the supplementary material, Section A.4, Figures 12 and 13: It would be nice to see these results separated out onto more plots; some of the markers obscure others. =====