Paper ID: 64
Title: k-variates++: more pluses in the k-means++
Review #1
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): The paper proposes a density based variation of the popular k-means++ algorithm. It then adds variations for the streaming and online cases. Experiments on real and synthetic data are done, and several results are proven, including provable effectiveness of the proposed algorithms in the distributed, streaming, and online settings, as well as connections with differential privacy.
Clarity - Justification: The supplementary material consists of 40 pages. Proofs are omitted from the main paper, and as are the details of the experiments. The authors have done a lot of work, and the desire to include it all can be justified, however it makes the conference paper too dense and unnecessarily difficult to follow.
Significance - Justification: The paper proposes an elegant variation of k-means++, and shows many of its advantages in a variety of settings.
Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Overall, the paper proposes an interesting algorithm with applications to numerous domains. I found the work interesting and believe that this research can have notable impact if presented more clearly. The main draw back is that as a conference paper, it is too dense, covering too much while omitting the details, proofs, and necessary explanations to motivate some of the topics, assumptions, and results. I'd recommend focusing on about 2 topics, and omitting at least one topic to make room for a better exposition. This work will have higher impact if the authors allow space to make it easier to follow. For the experimental results, the main paper should at minimum include a chart numerically comparing the misclassification error of the new method against k-means++ on several real data sets.
=====
Review #2
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): This paper proposes a generalization of the k-means++ method where one has to query the points through a (potentially noisy) probe function, rather than directly. The authors give a tight analysis of the dependence of the method on the parameters (mean/variance) of the probe function.
Clarity - Justification: There are a few typos, e.g.: smaller than expectable --> smaller than expected lowerbound --> lower bound
Significance - Justification: This is a neat and useful generalization that can be used to tackle many problems, as the authors themselves show.
Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The authors carefully analyze what happens when the sampling method in k-means++ is noisy -- this is equivalent to the sampling happening through some additional probe function p. This is a clean way to analyze many potential algorithms -- streaming models where the exact data points are forgotten, differential privacy approaches where the exact points should not be revealed, and so on. Overall the paper is interesting and well written. I have two concerns: - The model for the distributed setting is unorthodox and a bit strange. What is a scenario where there exists a node of high computational power, N? Why not use other, more standard models for distributed or parallel computation? - The parameters in Theorem 6 seem to be data dependent, and as such, I am not sure how to interpret the approximation ratio. As far as I can tell there is no way for one to compute what the approximation ratio of OLk-means++ would be while running the algorithm. Thus, makes the approximation result less useful in practice.
=====
Review #3
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): The k-means++ initialization is shown to be equivalent to sampling from a set of Dirac densities. The paper generalizes this to sampling from a local/noisy distribution. This enables the authors to propose initialization methods for different settings like Distributed setting, Streaming, Differential privacy preserving. Overall, the paper provides a generalized framework and a few incremental contributions to initializing centers in a wide variety of settings. This makes the paper extremely interesting.
Clarity - Justification: The paper is extremely well written. The contributions span a wide variety of problems and the content has been split appropriately to ensure readability of the main paper. However, 40 pages of supplementary material is quite long. A little more explanation of probe function (in Section 2) would be helpful. Also in the same section, it might be useful to point out that Id is the dxd identity matrix and not d times the Identity matrix.
Significance - Justification: Two main points that outline significance 1. The reduction of many existing methods to special instances of their proposed k-variates framework. The results offer comparable bounds using different analysis than known in the literature. 2. The proposed framework is also used to propose novel algorithms in slightly more restricted settings in some cases.
Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): k-variates++, the algorithm: It essentially modifies the k-means++ algorithm 1. Considering distances between probe points and chosen centers instead of distances between the raw points and the chosen centers. By bounding the distortion (or stretching) introduced by the probe points, they extend the algorithm to the streaming setting (Alg2). 2. Instead of sampling the point directly, a point is sampled from a local distribution. This is used to handle privacy concerns in the private distributed setting (in Alg1). The online k-means++ variant proposed involves running batch-style k-means++ on a mini batch S_j instead of the entire data. Thus the algorithm itself is not novel, but the theoretical guarantee provided in Theorem 6 is interesting. Overall the paper strongly validates its claims both theoretically and experimentally.
=====