Paper ID: 1164
Title: Speeding up k-means by approximating Euclidean distances via block vectors

Review #1
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): A computationally intensive aspect of running k-means involves computing the distance from points to candidate centers.  The paper presents two new ideas: a method of computing a lower bound on the distance via Holder’s inequality and also block vectors.   The combination of these ideas results in a speedup of the number of distance calculations.  The speed up is demonstrated empirically by comparing to k-means and yinyang.

Clarity - Justification: Clear presentation and reproducible algorithms on public data.

Significance - Justification: k-means is a widely used algorithm that is run on large datasets.  So algorithms that can be speed up k-means are important. 

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Speeding up k-means is a worthy and important goal.  This topic is pertinent to the ML community.  Is it possible to (mathematically) analyze the number of distance comparisons saved -- possibly in specific situations, e.g., mixture of Gaussians?  Can these techniques also speed up the initialization of k-means, were it to be performed via k-means++?  There are other approximation algorithms that may also benefit from these lower bound calculations.  This paper  "Local search heuristics for k-median and facility location problems" gives an iterative algorithm wherein k arbitrary points are initially selected as centers and a center is repeatedly swapped with a non-center if it reduces the clustering objective.  I wonder if you might experience gains in run time in that context as well.  The baselines should be strengthened.  Can you include a comparison to “Scalable k-means++”, a parallel algorithm?  Overall, this paper offers new insights into speeding up k-means.  The experiments can be strengthened by comparing to more baselines, and also showing how the techniques may apply to other clustering algorithms.

=====

Review #2
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): This paper proposes an acceleration of the Lloyd's algorithm for k-means clustering by efficiently upper bounding pairwise inner-products (and consequently lower bounding pairwise distances) via 'blockification' to improve the efficiency of the cluster assignment phase of the Lloyd's algorithm by eliminating unnecessary distance computations of points to cluster centers. The paper also presents a similar acceleration of the recently proposed 'Yinyang' k-means algorithm. Empirical results indicate that the proposed acceleration almost always provides speedup over the baseline algorithm, especially for really large value of k for clustering and for datasets with favourably structured sparse dimensions.

Clarity - Justification: The authors motivate the problem and the intuitive behind the main idea very clearly and the paper is very easy to follow. However, there are some specific parts I am unable to follow: * It is not clear to me how the blocking is precisely done -- one can imagine that the speedups from a smart blocking might be much more substantial than speedups from a vanilla evenly spaced blocking.  * Related to the above point, it is unclear how the blocking is precisely done so that the condition annz(X_B) ~ 0.3 annz(X) is satisfied. If we are to naively try different block sizes and compute the annz(.) function, this can be computationally very expensive. * Is the blockification time considered in the runtime for the proposed acceleration? * I do not quite follow the idea behind Figure 2. It seems to imply that memory overhead of Yinyang scales linearly with the number of clusters k while the overhead of 'blockification' is independent of k. This is misleading to me since, from what I understand, fast-yinyang will always have the memory overhead of both yinyang and blockification (since blockification is applied on top of yinyang) and hence the final algorithm will still have overhead that scales linearly with the number of clusters. * While I follow why yinyang memory overhead scales at O(nk/10), it is not clear to me from the text why this overhead also increases with sparsity in Figure 2.

Significance - Justification: The manuscript presents an interesting idea to accelerate the exact Lloyd's algorithm for k-means clustering and empirically appears to be most effective when the number of clusters k is large and data has high sparsity. The proposed idea is also orthogonal to the usual accelerations that take advantage of the triangle inequality and hence can be used in conjunction with them. In terms of significance, this work can be seen as the application of a special case of the Hoelder compression proposed by Low & Zheng, 2012 (see below) to upper bound inner products for the task of efficient k-means clustering.  One concern is that this acceleration scheme requires a memory overhead that is linear in the size of the dataset (the authors use a setting that makes the overhead 0.3 times the dataset size) which can be quite limiting for really large datasets (and limited memory). This overhead appears to be on top of whatever overhead the base algorithm has (for example, the O(nk/10) overhead of yinyang). The authors should motivate why this kind of overhead is acceptable. Otherwise, the memory overhead might make it hard to apply this acceleration to large scale k-means clustering tasks 

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The idea of blockification is at the core of this manuscript. However, these kinds of Hoelder bounds are quite loose if \langle x, y \rangle can be < 0 for some x, y \in the dataset. But, since clustering is trannslation invariant, we can translate the whole dataset to the positive quadrant and ensure that \langle x, y \rangle >= 0. However, in that case, this 'blockification' idea is very similar to the t-level Hoelder compression proposed by Low & Zheng, 2012 for t = 1. It would be good for the authors to compare and contrast the proposed blockification to Hoelder compression for the purposes of upper bounding the pairwise inner products.   Minor comment: * It would be useful to explain why figure 1 and figure 3 do not present all datasets considered in the empirical section.  Citations: Low, Yucheng, and Alice X. Zheng. "Fast top-k similarity queries via matrix compression." Proceedings of the 21st ACM international conference on Information and knowledge management. ACM, 2012.

=====

Review #3
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): Given n points of dimension d, the k-means problem is to find k centers that minimize the sum of the squared distances of each point to its closest center. Lloyd's algorithm is a popular local search heuristic from the 60s.  The bottleneck in an iteration of Lloyd's algorithm for the k-means problem is to identify the closest center c(p) for each input point p. This is typically done by computing the distance of p to *all* centers and choosing the minimum. Thus, as computing the distance from one point to one center takes time O(d), the overall running time of one iteration is O(nkd).    The submission in question addresses the bottleneck under the premise that an improved algorithm should yield exactly the same clustering as the original Lloyd's algorithm (because this clustering is known to be meaningful in practice). More specifically, it strives to heuristically reduce the dependence on d.  The first part of the paper provides a lower bound lb(x,y) on the distance between two points x and y. The bound can be computed in constant time given that (essentially) the length of x and y is precomputed. It is based on Hölder's inequality that is tightened by combining the coordinates of x and y into blocks.  The bound can be used in the following way: Knowing dist(x,y), we can guarantee that dist(x,y) < dist(x,z) if lb(x,z) > dist(x,y). Using this observation allows the authors to skip costly distance computations in Lloyd's algorithm.   The paper then evaluates the use of this heuristic running time improvement in the standard algorithm by Lloyd and in an improved variant by Ding and others (called Yinyang implementation). The evaluation is done on nine popular instances from the literature and for three different values of k per instance.  It turns out that the modification of Lloyd's algorithm yields a speed up over the Yinyang implementation in 14 of 27 cases; the modification of the Yinyang implementation yields a speed-up over the original Yinyang implementation in 20 out of 27 cases.

Clarity - Justification: I found the paper to be very well written and easily accessible.

Significance - Justification: The idea to use lower bounds to save distance computations is not new (the same idea is used in the Yinyang implementation, for instance). However, the idea to use Hölder's inequality seems to be new (at least in this area). The same goes for the idea to improve the bound by grouping the coordinates of the points in blocks.   The paper is not a major breakthrough, but I feel that it's method could be adopted by the community if further experiments prove its worth. It does not open up a new area for research but rather incrementally modifies existing work. The paper shows a solid knowledge on related work on speeding up Lloyd's algorithm.

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): While the theoretical part of the paper is sound, I would have liked a more verbose experimental section to convince me of the usefulness of the method. Here are some questions that could have been addressed:  - What is the influence of the choice of the initial centers on the outcome of the experiment? Are all algorithms initialized with the same set of centers? If the choice of initial centers is random, how large is the variance in the speed-up? - What is the influence of the dimension of the points on the speedup? - What about a comparison to the original Lloyd implementation and to Elkan's variant?  Other comments: - Is Proposition 3.1 new/original? Maybe state a reference.

=====