We thank all the reviewers. We are glad that the reviewers have found that "variational Nyström method subsequently feels logical and well-motivated" and it has "insightful overview Table 1 and the good improvements in approximation and number of landmarks".$
Assigned_Reviewer_1:

Thanks for thoughtful and important comments.

Use of full M in VN. For the same number of landmarks, VN (as LLL) gives better approximation than Nystrom, because it uses non-landmark elements from M, while Nystrom just ignores them. We agree that it ends up being more costly, but the overall trade-off is more favorable towards VN and LLL as we show in the experimental section. If the matrix is way too dense and too large to use, we can further sparsify it e.g. by setting its smallest elements to zero. We think that this gives better results than Nystrom, that just ignores the whole non-landmark rows of the affinity matrix.

Computational cost. Practically, for many applications, M is (very) sparse. The computational cost is O(C.N.L^2), where C depends on the number of non-zeros in M. While the exact formula depends on the particular structure of M, for sparse M when C much less than N^2 the total cost is linear.

Thanks for pointing us to the Gisbrecht et al paper. The analysis and the formulas that they got in sec. 4.1 matches nicely our data from tab.1. Their analysis of converting dissimilarities to the affinities can be used as a preprocessing step before VN, precisely because VN can work with (dis)similarities as an input (unlike LLL, as we note in l.390-395). We will reflect this in the paper.

Regarding non-psd M. That is an interesting point. While we don't require M to be psd, the reduced affinities C'MC should be psd and symmetric to constitute a valid spectral problem (and not even that if used non-metric pseudo-Euclidean space defined in Gisbrecht et al). As we discuss in ll. 635-641 we see sometimes that VN makes the problem more robust. It would be an interesting generalization to see how VN works in non-psd settings. 

Number of landmarks L and their location. Increasing L would improve the result of the algorithm, thus our approach is quite simple - select L as big as you computationally can. We have found that selecting landmarks at random works quite well and robust (note the tight error-bars in all our experiments). We think that more careful choice of the landmarks (using e.g. leverage scores) can improve the results, but also increase the runtime. Generally, due to lack of space we have omitted this discussion.

Rank-based criterion from Lee and Verleysen,09 is interesting, but it is only one of many. In the end it all depends which end application are you going to use the method for. As far as this paper is concern we thought that visual results would be the best to demonstrate.

Experimental results. Due to a lack of space it was hard to include more results in the paper, but based on our pilot runs with different datasets (spiral, text embedding) the results are similar. Including a table is a good suggestion, but we felt that the fig. 1-2 provide more comprehensive and visual analysis of the algorithms, albeit limited to only one dataset per figure.

We will clarify all the minor corrections. Thanks!