Paper ID: 631 Title: $K$-Means Clustering with Distributed Dimensions Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper proposes an algorithm for k-means clustering when the computation is distributed between agents so that each of those has access to only a projection of the input data matrix on a subset of the coordinates. The algorithmic contribution of the paper is a proposal of how to use known k-means approximation algorithms as subroutines (both to carry out the clustering by each agent, and to combine the outcomes of the different agents into a single clustering of the input data set), and claims (Theorem 1) that the resulting combined algorithm achieves an approximation ratio that is roughly only 5 times worse than the approximation ratio of the subroutine algorithm, while keeping the communication costs between teh agents and the central server low. Clarity - Justification: The soft tissue part of the paper is very well written. However, some key the technical steps were presented by a formula that I could not decrypt (and no verbal explanation provided). Significance - Justification: I have two concerns about the paper. The first is the motivation. In their motivating comments the authors describe scenarios in which the input instances have different types of attributes and each agent measures, or has access to, only to attributes in belonging to only one such type. However, in such scenarios, I cannot imagine that k-means clustering over the full attribute vectors will make any practical sense. (A common motivation for carrying out distributed computation is to save memory capacity or computation costs. However, in the scenario analyzed in this paper there is no significant benefit in that respect, since the authors assume that the number of agents, T is very small ("constant") and anyway some of the most common k-means algorithm start by projecting the data to a low dimension). My second, and maybe more severe concern about the significance of the paper is that I am skeptical about the results (as I will explain in the detailed comments below). Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): 1) I fail to parse the formula at the end of part 5 of Alg 1. This is a crucial formula since it defines how the central server determines the final clusters. Consequently, it is not clear to me why the resulting collection of k subsets covers all of the input data set. 2) In particular, the intersection over all l \in T of the sets M^l_{i_l} is not clear. It seems to rely on the matching of each agent index l with a cluster index i_l. I could not find a definition of that matching. 3) It is rather suspicious that the approximation quality of the resulting clustering, as stated in Thm 1, does not depend on the number of processors that the algorithm is distributed between (T). 4) Algorithm 2 does not seem to be an algorithm. Rather it seems to be a definition of some collection of weighted points. 5) (this is just a proposal for improvement for the authors): In the description of the method (lines 253-254) the authors propose that each party send to the central server, not only its local k centers but also the membership of each point. This is needless, since the central server can compute that independently from knowing the centers and the coordinates assigned to each party. ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper proposes an approach of doing KMeans clustering in a distributed fashion, where the dimensions of the dataset are distributed across multiple machines. The authors provide approximation algorithms in this setting that give constant approximation ratios. They also perform experiments that demonstrate the effectiveness of their distributed KMeans algorithm. Clarity - Justification: The paper is clearly written -- the mathematical formulation is well-motivated and well-explained. Significance - Justification: Distributed clustering is increasingly becoming an important tool for clustering -- this paper proposes a novel approach doing clustering by partitioning the dimensions of the data, and shows the overall effectiveness of the approach through experiments. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Strengths: + The paper derives approximation algorithms for KMeans clustering with distributed dimensions, giving constant approximation ratio with low computational overhead. + It shows via experiments how the distributed algorithm is efficient as a clustering algorithm. Weaknesses: - It would have been good if the authors showed the robustness of the distributed algorithm to noise in the data. Overall the paper is quite novel, and is a strong submission. ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper discusses the k-means problem in a distributed setting. The input, consisting of n points of dimension d, is distributed among T parties. Every party has a distinct subset of the *dimensions* of the points, i.e. every party knows about a subset of the features (and it knows point ids, such that the parties can exchange information in a useful way). There is also a central server. The goal is that the central server knows a good (approximate) solution for the k-means problem for the distributed instance. The other parties do not need to know the solution. The communication cost of the approximation algorithm shall be small. The paper proposes the first constant factor approximations dedicated specifically to this problem setting. It also overviews the conclusions that can be drawn from existing work on the related problem of dimensionality reduction for k-means. In comparison, the communication complexity of the new algorithms has the advantage that it does not include a term of the form f(k)*n as all related results do. The communication complexity of the presented 9+eps-approximation is T*n+k*d (numbers). The paper also includes a lower bound of T*n on the communication complexity of the problem. The algorithm is also improved to an approximation ratio of 5+eps at the cost of a slightly larger communication complexity. Furthermore, the paper demonstrates the adaptability of the new method by extending the result to two variations of the k-means problem: constraint k-means and k-means with outliers. The paper concludes with experiments that compare the new methods to two algorithms deduced from known work on dimensionality reduction. Clarity - Justification: The paper is clearly structured and well written. The only part that is a bit difficult to parse is the pseudocode. Very positively, I could follow and verify the proofs of the approximation guarantees of the two algorithms relatively easily and found no gaps or unclear steps here. In this important section, I also found little to no typos. The section on other results gives a good idea why and how the algorithms can be adapted. The shorter sections 5-8 seem to contain more typos, but are also clearly written. Significance - Justification: The k-means problem is an old and well-studied problem, and algorithms for it are widely used for clustering. The variant studied in this paper is new, but sufficiently close to research directions that have already been accepted as interesting for the k-means problem, like dimensionality reduction or size reduction (coresets). I think that the study of this version enriches the study of clustering problems and is interesting. The new methods are elegant, provide an approximation guarantee, are implementable/practical and are adaptable to more than only the basic k-means formulation (a feature not commonly shared by approximation algorithms for the k-means problem). The paper also contains a convincing discussion of related work and how the new method fits in. The studied distributed setting can be solved by using dimensionality reduction techniques: Every party can interpret its data as a k-means instance, reduce the dimensionality of this instance while approximately preserving the k-means cost, and then send the smaller instance to the server. This enables the server to compute an approximate solution, which can be projected back into the original space by a little more communication between server and parties. Known dimensionality reduction techniques include random projection based techniques, singular value decomposition (svd) based techniques, and svd variants that do feature selection (i.e., project to axis aligned subspaces). The paper derives the communication cost that this approach induces for different known dimensionality reduction techniques. I checked two of the stated bounds (the calculations are omitted in the paper) and they seem correct. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): My recommendation is "accept" with a tendency to "strong accept". Since "weak" and "strong" accept are the only options, I choose "strong accept". I like the studied problem, the simple yet approximate algorithms, their adaptability and the fact that they have been implemented and tested. It is also nice that the algorithms are complemented with a lower bound. The only drawback I see is that the experiments section is not as extensive as I would have wished. However, it does provide a solid basis experiment. Related work ------------ The section on related work on dimensionality reduction is very convincing to me. The description in Figure 2 is very helpful (could maybe moved to the main text?). Section 8 should be placed in the introduction with the other related work. Also, the algorithms by Arthur/Vassilvitskii and Chawla/Gionis that are used in the experimental section should be described in the related work section. At least state that the A/V algorithm is an expected O(log k)-approximation. (9+3)- and (5+eps)-approximation -------------------------------- These sections are very clear and I did not find any errors in the proofs. My only remark is that it would be very helpful to have a picture illustrating Algorithm 1. For example, one could add a small two-dimensional example where each of two parties has one of the axes. Then one could print the corresponding centers of the two parties on the axes, show all grid points and visualize which points go to which grid point. This figure could replace Figure 3 -- application of the triangle inequality should be possible without an illustration. Other results ------------- The section on k-means clustering with outliers was convincing to me and I appreciated the supplementary material on this result. The paragraph on k-means with chromatic conflict could use a bit more detail for my taste. One thing that particularly confused me is the following: In the previous results, all points that belong to the same grid point are later clustered together. However, here it is stated that the points belonging to a grid point shall be considered as a set of individual colored points locating at the same location. How does this affect the proof? I don't think that it poses a problem, but I think a bit more detail would be nice. Experiments ----------- This section is relatively short. Some remarks on information that could be added: - What is the number of repetitions for each test case? - Maybe you can discuss why you chose to use the A/V algorithm despite the fact that there are algorithms with much better guarantees. Potentially, algorithms that are heuristic/have large approximation factors can by themselves induce different clustering costs (maybe the new algorithms just work better with A/V than the other approaches?). This also holds for the C/G algorithm which is a heuristic. A bit discussion whether there is a reason to believe that A/V and C/G are stable enough to make the conclusions drawn for your algorithm reliable would be nice. - Related question: Is the variance of A/V low on your data? Since it is a randomized algorithm. - It would also be good to discuss the large data set a bit more: I expect that it is very sparse since it is a bag of word data set. What effect does this have on the proposed algorithms and on the random projection approach? Remarks on the writing - In the description in lines 137-140, it is stated that "T and k are much smaller in typical applications". I'm not sure that this is always true for k. - Figure 4 is unfortunately placed, can it move to the next page? - printed on a standard b/w printer, the diagrams in Figure 5 are hard to read. Typos - l. 39+40: "in distributed setting" -> "in the distributed setting" - l. 405: "the optimal solution" -> "an optimal solution" - l. 448/449: "by triangle inequality" -> "by the relaxed triangle inequality" - l. 536: "the value .. can only be non-increase" -> "the value .. cannot increase" - l. 687: "An DISJ_n,T can be.." -> "An DISJ_n,T instance can be.." - l. 692/693: "by max_X": X is not defined or used - l. 727/728: "our algorithms of DistDim-k-means and.." -> "our algorithms DistDim-k-means and.. - l. 730: "(1) random projection based approach" -> "the random projection based approach" - l. 757: "considering distributed setting" -> "considering the distributed setting" - l. 796: "as showed" -> "as shown" - l. 821: "apply coreset" -> "apply coresets" - l. 873: "to achieve the similar approximation ratio" -> "to achieve a similar approximation ratio" ===== Review #4 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper aims at providing a framework for supporting k-mean clustering in distributed settings, which is first proposed as general solution (as to lower communication costs), and then extended to two specific settings: k-means clustering with outliers, and constrained k-means clustering. The results are appreciable as theoretical analysis, but poor as experimental assessment and findings. Clarity - Justification: Results are poor themselves. In this kind of algorithms, you need much more experimental analysis and testing. The experiments you considered are poor in impact and also datasets are limited in size and dimensionality. Overall, the reproducibility of proposed techniques appears difficult. Significance - Justification: Distributed dimension management mixed with clustering is not a big novelty in machine learning and, specially, data mining. Look at the case of distributed OLAP processing. You can easily understand that your dimension-based distribution method is really “baseline”, with poor novelty. All the rest is a simply straightforward analysis on what k-means could be in distributed settings. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The paper aims at providing a framework for supporting k-mean clustering in distributed settings, which is first proposed as general solution (as to lower communication costs), and then extended to two specific settings: k-means clustering with outliers, and constrained k-means clustering. Distributed dimension management mixed with clustering is not a big novelty in machine learning and, specially, data mining. Look at the case of distributed OLAP processing. You can easily understand that your dimension-based distribution method is really “baseline”, with poor novelty. All the rest is a simply straightforward analysis on what k-means could be in distributed settings. Some assumptions are too strong. E.g. what happens with very irregular dimensions? Those are dimensions whose cardinalities differ a lot each other (e.g., 10 and 1.000.000) >> see that your distribution model clearly does not work well, in this case. Your model poorly considers the possibility of network failures, which indeed is still relevant to take into account. Proposed approximations are, indeed, simple extensions of state-of-the-art theoretical analysis for data communication systems. Results are poor themselves. In this kind of algorithms, you need much more experimental analysis and testing. The experiments you considered are poor in impact and also datasets are limited in size and dimensionality. Overall, the reproducibility of proposed techniques appears difficult. =====