Dear Reviewers,$
thank you for careful reading of our work and insightful remarks. Rev_1 and Rev_5's main comments boil down to bibliography. We will start our answer by discussing these points. We then provide a point-by-point answer to Rev_3's comments.
PREVIOUS WORKS SHORTCOMINGS:
* works cited in lines 73 to 82 all propose ways to sidestep the partial diagonalisation of the Laplacian and approximate SC's feature vectors (often provably). Shortcomings: they still have to compute the high-dimensional k-means, which typically costs O(Nk^2) per iteration and becomes prohibitive for large N and k.
* Yan09 and Wang09 propose several sampling methods, for instance random or selective sampling. Shortcomings : there are no guarantees that their propositions are true approximations of SC.
CONTRIBUTIONS:
We propose to address both SC's bottlenecks.
* For bottleneck #1, the originality of our method is that we do not try to estimate the exact classical feature vectors but instead estimate equivalent features with similar \emph{inter-distance}.
* For bottleneck #2, the originality of our method comes from the fact that we have guarantees (under the assumptions we provide in the paper) that sampling k\log{k} nodes, performing low-dimensional k-means, and interpolating the results back to the whole graph does provide the classical SC results with an approximation error that we control.
COMPARISONS WITH “CORE-SETS”:
Core-sets methods group methods that first compute “representatives” nodes, ie a subset of nodes, for which the k-means cost of the optimal solution is well controlled by the k-means cost of the optimal solution on the whole data-set. Heuristics are typically used on these representative data points to cluster them in low-dimension; and the solution is then in general trivially interpolated by assigning the cluster label of each representative node to all nodes it represents in high dimension. The nice property of core-sets is their guarantees on the approximation error.
First of all, note that the works we cited [Yan09 and Wang09] propose very similar ideas. In Yan09, authors explicitly cite the “core-sets” literature as ideas very related to their work. There are therefore links to the core-sets literature in the cited papers. We do agree though that we should increase the core-sets visibility.
Second, core-sets were created to decrease the complexity of k-means algorithms, and are not specifically tailored to approximate or propose fast versions of SC; which is our goal in our paper. For instance, core-sets will not solve the problem of feature computation in SC.
Third, let us see how core-sets could fit in our algorithm \emph{after} feature computation. They could fit in two ways:
1) After feature sampling, one could use coresets-kmeans instead of classical k-means (provided k\log{k} nodes is not too small to really give significance to core-sets). Coresets-kmeans is complementary to our algorithm in this case.
2) Instead of the "feature sampling+low-dimensional k-means+interpolation" steps [which costs O(k^2\log{k}^2+pNk)]; one could use coresets-kmeans [which costs O(N+k^5\log{N}^9) according to Thm 6.1 of “Coresets for k-means and k-median clustering and their applications” by Har-Peled and Mazumdar] to solve the high-dimensional k-means method. This would yield another method. None of the two methods has a clear edge over the other in terms of theoretical computation cost (even though for large k our method seems less computationally expensive), and precise comparisons should be made. One of the main differences between these two methods is that we explicitly exploit the structure of the underlying similarity graph for sampling, whereas coresets-kmeans does not. Due to lack of space nevertheless, we post-pone precise comparisons to later work. Nevertheless, we agree we sould add direct references to the core-sets literature.
POINT-BY-POINT ANSWER TO REV_3:
* Yes, the k-means general problem is solved by heuristics that typically get stuck in local optima. But this is not our point here. There is only one step that we do not precisely control (pointed out line 370): that the low-dimensional k-means provides the same result than the high-dimensional k-means results reduced to the small subset of sampled nodes. If this assumption is correct (on top of c_j close to \span{U_k} and M satisfying the RIP), then Eq. (5) will output the correct result. This is what we mean by “faithful estimation”.
* As explained in Sec 4.4, the complexity of solving Eq. (5) with gradient descent algorithms is O(pkN). As discussed in Sec 5, we choose a broad-range off-the-shelf solver: gmres. A detailed discussion on optimal algorithms for Eq (5) would take too much space.
* On top of SBMs, we also demo on the Amazon graph which is irregular! We performed experiments on other irregular examples but we unfortunately do not have space to present them.
Yours,
The authors.