Paper ID: 128 Title: The Variational Nystrom method for large-scale spectral problems Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The authors address the problem of efficiently estimating the eigenvectors of large affinity matrices. Such spectral problems arise frequently in kernel methods, spectral clustering, manifold learning, etc. They provide the following contributions: - A synthesis of existing techniques, yielding a framework for interpreting different approximations as extending an embedding based on a subset of landmark points via transformation by an "out-of-sample" matrix. - Formulation of a new "Variational Nyström method", motivated within their framework as directly optimizing the landmark embedding for a chosen out-of-sample matrix. This takes the form of a generalized eigenproblem. - An investigation of the normalization issues that arise in solving spectral problems involving graph Laplacians with subsampling based methods, including a theoretical result indicating their variational method requires fewer normalization approximations than pure Nyström or column sampling, as well as an empirical study of different plausible normalization approaches. - Empirical results demonstrating that variational Nyström achieves greater accuracy for the same runtime on some datasets when low to medium accuracy results are acceptable. Clarity - Justification: The authors do a good job summarizing prior work, providing a clear and succinct presentation. Their introduction of the variational Nyström method subsequently feels logical and well-motivated, and their empirical work is easy to follow. The paper is well-organized, and notation is clearly defined throughout. Significance - Justification: Faster methods for solving large scale spectral problems are significant. The paper provides an extension of existing techniques for estimating eigenvectors of kernel matrices that appears to yield better runtime-accuracy tradeoffs for the practical setting where the matrix is very large and accuracy requirements aren't extremely strict. The approach is incremental (sharing many individual components with prior work), but nonetheless elegant and effective. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The method is well-motivated, and the runtime-accuracy experiments are intriguing. However, it would be good to see a more thorough set of experiments, especially given the somewhat weak theoretical component. The empirical results could be made more compelling by including additional datasets beyond MNIST; it would be reassuring to see quantitative demonstrations that the better accuracy at equivalent runtime generalizes. While the spectral clustering example is visually striking, the comparison is only against Nyström (the previous experiments suggest a comparison with LLL), and the MNIST embedding visualization is suggestive but not particularly conclusive. Perhaps more significantly, the scope of the work seems broader than the focus on spectral embeddings would suggest. Some investigation or least discussion of using the approach to speed up kernel-based learning methods would be welcome, given that both Nyström and column sampling are effective in this area. This appears to have been partially covered by the referenced modified Nyström method, which shares eigenvectors with the presented variational approach. At minimum a brief summary of their findings regarding kernel approximation error would be helpful; however the variational method could also be used to derive a matrix projection based low rank approximation, via X'*X*M, which may be more accurate than the purely spectral approximation that is given by modified Nyström (where the eigenvalues are obtained from the reduced problem). An empirical demonstration of the runtime-accuracy tradeoffs for kernel approximation (or classification) similar to Figure 2 could provide a useful additional contribution and boost the significance of the paper. Minor fixes: On lines 107-108: "since it is optimizes the choice of the out-of-sample matrix instead of fixing it" This seems backwards, since the out-of-sample matrix is fixed while the landmark embeddings are optimized. 111: how do use -> how to use 239: not is it obvious -> nor is it obvious 446: there have been proposed two ways -> two ways have been proposed 533: We going to -> We are going to 796: an set up -> a set up ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): Improved Nystrom method by constraining the original problem to satisfy the Nystrom formula. It gives a lower approximation error and fewer landmarks. Clarity - Justification: The authors make a good presentation, interesting comparison table 1 with several comparisons. Significance - Justification: Though the contribution is incremental with respect to the Nystrom literature, it is nevertheless very interesting with the insightful overview Table 1 and the good improvements in approximation and number of landmarks. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The paper is insightful at the level of out-of-sample formulas and comparisons. It also gives nice improved results. One aspect that could possibly receive more attention is to compare with the work on kernel spectral clustering, where explicit kernel-based model representations are available for making out-of-sample extensions (either exact or Nystrom based). These methods have also been applied to large graphs. It is unclear whether those kernel-based learning methods obey the same or a different out-of-sample formula than proposed in the paper. ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The Variational Nystr ̈om Method for Large-Scale Spectral Problems The authors propose the so called variational nystroem method which is a combination (mainly) of concepts from the classical Nystroem technique and LLL (an embedding approach). While interesting the approach makes internal use of the full matrix M - although this is reduced in the final model step to small matrices. Accordingly during the calculation large compuation cost are expected and it may become impossible to apply the approach to large datasets where M gets large as well. The authors argue that this can be done in an iterative manner but still at least N^2 operations are necessary. This maybe parallelized of course but again this can still be costly for very large N and it maybe infeasible if the calculation of all similarities is expensive (e.g. if the similarity function is costly by itself). The linear runtime complexity assumption is a bit shaky. A probably novel part is the analysis of the Nystroem approximation for approximating graph laplacians (sec 4). Especially the related experiments are interesting but it is hard to get theoretically generalizing conclusions from the results. Clarity - Justification: The paper presentation is quite clear. Significance - Justification: As detail in the full review the paper has some interesting points. But the paper has also some point to criticize. Especially the evaluation should be modified and the experiments needs further extension. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Major comments: - nice overview about recent approach in the Nystroem context - 'Its fundamental disadvantage is that the reduced eigenproblem on the land- marks, to which the Nystr ̈om formula applies, uses only the landmark-landmark affinity values. If too few landmarks are used, this eigenproblem gives a bad approximation and so does the Nystr ̈om extrapolation.' -- well this is true for the classical setting but in Gisbrecht et al. an approach was shown to obtain an exact eigenvalue decomposition for low rank matrices using the Nystroem technique with linear costs - see: @article{DBLP:journals/ijon/GisbrechtS15 (section 4.1), author = {Andrej Gisbrecht and Frank{-}Michael Schleif}, title = {Metric and non-metric proximity transformations at linear costs}, journal = {Neurocomputing}, volume = {167}, pages = {643--657}, year = {2015}, url = {http://dx.doi.org/10.1016/j.neucom.2015.04.017}, doi = {10.1016/j.neucom.2015.04.017}, timestamp = {Thu, 30 Jul 2015 19:29:58 +0200}, biburl = {http://dblp.uni-trier.de/rec/bib/journals/ijon/GisbrechtS15}, bibsource = {dblp computer science bibliography, http://dblp.org} } --> Gisbrecht has also some toolbox on his wepage -> for comparisons - beside of the number of landmarks a substantial problem of the Nystroem approach is to find a good set of landmark points ... - I think this should be mentioned as well and some brief reference to relevant work in this line could be provided - I think it would be good to include the approach of Gisbrecht for the eigendecomposition (which has the additional benefit that it works for indefinite matrices as well) - you write 'for Variational Nystr ̈om becomes C T MC (of L × L) and ...' this means you have to use the full matrix M in your computations. I wonder if this is not very unattractive from a practical point of view or may become impossible if M would get large for large N. Actually a very positive point of the Nystroem approximation is to avoid larger calculations and that the full matrix M is not needed in any step - 'This makes the method applicable when the affinity between two points is a sophisticated function of their context, as in image segmentation using intervening contours cues' --> hm, this maybe true, but it would be good to address whether the affinities are somehow expected to be metric (leading to M being psd) - e.g. are you expecting symmetry a.s.o ? --> so basically specify what are your mathematical requirements onto the affinty matrix M (and the underlying proximity function) - ', but we find that random landmarks give better error-runtime tradeoff unless one uses very few landmarks.' --> this is indeed a point - typically you would like to use few landmarks to limit the costs in the out of sample extension and also to reduce computational load in the approximation (model) step --> why do you think that chosing a large number of landmarks (how many) - is meaningful for your approach - Runtime: 'and M is sparse, the cost is linear in number of points N' - hm, your are putting some things under the carpet. Having M very sparse is not so likely - e.g. if data distributions are not very well separated; Further it is hard to judge how spare M should be to get linear runtime complexity in real settings if the matrix M is included in your model generation (in full - although iteratively processed) - for evaluation embeddings (including lap-eigenmaps) I suggest to use a framework as suggested by M. Verleysen see e.g. (or related work): @article{DBLP:journals/ijon/LeeV09, author = {John Aldo Lee and Michel Verleysen}, title = {Quality assessment of dimensionality reduction: Rank-based criteria}, journal = {Neurocomputing}, volume = {72}, number = {7-9}, pages = {1431--1443}, year = {2009}, url = {http://dx.doi.org/10.1016/j.neucom.2008.12.017}, doi = {10.1016/j.neucom.2008.12.017}, timestamp = {Wed, 13 May 2009 11:26:37 +0200}, biburl = {http://dblp.uni-trier.de/rec/bib/journals/ijon/LeeV09}, bibsource = {dblp computer science bibliography, http://dblp.org} } - for sparsification you could have a look at: (MEKA - jmlr.org/proceedings/papers/v32/si14.pdf). - I hope your conclusions on p6 are not only due to the single swissroll dataset ... - Experiments are done based on two datsets only (MNIST and swissprot) this is I have to say a bit less to draw more general conclusions. Why not providing some more datasets but with summarized error results in a table ... Minor comments: - 'We show that by constraining the original problem to satisfy the Nystr ̈om formula,' --> this sentence is rather unclear - 'how do use subsampling approximations with data-dependent' - 'how to use ...'? - 'It was originally proposed' -- here you may actually add a reference to the original work of Nystroem - are you sure you want to say 'W.l.o.g.,' before Eq 1 or should it not be without l.o.g :-) - 'not is it obvious how it can be used ... ' --> 'nor is it ...' - 'do not use the whole affinity matrix, available from the' --> ( - ) use the whole ...? - not used in neither --> neither used in ... nor in ... - In sec 4 K. Zhang provided (aligned with a published paper) some code for embedding approaches including a Nystroem approximated laplacian eigenmaps (so using graph laplacians): https://www.cse.ust.hk/~twinsen/manifold.htm This could may be instructive in the description of sec 4 =====