Paper ID: 1221 Title: Clustering High Dimensional Categorical Data via Topographical Features Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper presents a method for clustering of high-dimensional data by characterising the dataset in terms of attractive basins around modes, overlaid with a tree-based density model. This is shown to provide high level topographical features, and algorithms are presented for efficient inference. Clarity - Justification: The English expression starts well in the first column of the paper, but then strays in the second and beyond, which is distracting. At times, the presentation labours in fancy ML-speak, modes/topography etc, when a more generic (computer science) presentation would have suited. Significance - Justification: Overall, I think it's a good paper, and it should be accepted. Its presentation is reasonable, but not great, and I think that's my only real concern. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): I really like the fact that this is based on an "old" result, Chow Liu, 1968. This particular procedure isn't familiar to me, but the idea of tree embeddings is central to much algorithmic thought, and it's great to see it applied here. They are also combining local search and dynamic programming (via the tree structure), which are standard techniques, but the sophistication lies in the way they are combined. The data sets don't seem enormous, and experiments that run for only ten minutes aren't that compelling, but there is certainly a range. I found some of the data sets imperfectly described, especially the synthetic ones ("randomly flipped" is a bit vague. And what is the ACG data set? Thankfully, these authors did include running times, and it seems there are very few data sets on which there is an algorithm that is superior in both run time and effectiveness to that in the paper. Figure 2 is a bit odd as it itself is a 2-D embedding. And what about testing against single-linkage methods? As for spectral clustering, since it's fast, I'd still like to have seen the results. Given that the authors could have spent another column, perhaps some analysis beyond simply NMI would have been helpful, though Figure 2 is certainly appreciated. ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The authors present a novel clustering algorithm for data with discrete features. It essentially involves learning a chow liu tree over the features and then finding modes / attractive basins in the probability distribution induced by the chow liu tree. Empirically the authors show on UCI datasets that their method performs favorably to existing approaches. Clarity - Justification: Overall I think the paper is well written and readable but I have a few suggestions for improvement: (1) The use of the term "label" to describe the feature value is a bit confusing since in clustering it usually means "clustering label". (2) A toy example would be nice. (3) I think first describing the meta-clustering algorithm and then using it to motivate the choice of the tree / Chow Liu algorithm would be better. In particular the chow liu tree is needed to deal with the exponential number of choices for next(x) (4) Better discussing the parallel properties of the algorithm. Am I correct in understanding that each point can be processed independently of the others? That would be a considerable advantage to the approach. Significance - Justification: I think the method is quite innovative. Is it true that once the chow liu tree has been learned each point can be processed in parallel of the others? That would make the algorithm have considerable computational benefits of mixture models / k-means etc. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Each datapoint x has the form (x_1,...,x_D) where each x_i takes on one of a discrete set of values. The authors propose an interesting approach for doing clustering. In particular for each point x in the dataset, they consider the (exponential set) of all "neighbors": N(x) = {y : d(x,y) < \delta} where d is the hamming loss. Note that while x is in the data, N(x) may consist of points not in the dataset and therefore is exponential in size. The author's goal is to iteratively find: y <- next(x) = argmax_{y \in N(x)} p(x) under a certain probability distribution p. This is done until convergence (i.e. x = next(x) ). This would normally be intractable due to the exponential size of N(x) but the authors show a tractable formulation by assuming p(\cdot) follows a tree distribution. Experiments are shown on UCI datasets that the algorithm achieves better performance than existing clustering approaches and has reasonable runtime. Suggestions for improvement: In addition to my comments in the above sections, I think the paper could be improved in the following ways: (1) More discussion of runtime in the context of the complexity of other approaches like K-Means (2) Why is the algorithm compared to mixtures of gaussians if the data is discrete? I don't understand how this is done. Furthermore, mixtures of multinomials is the standard analog to mixtures of gaussians in the discrete setting. The authors do compare with mixtures of multinomials, but I think they should also mention this in the related work section in the Introduction. (3) Running on some larger datasets to demonstrate scalability. ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper presents a method for clustering based on trees. First, a tree structured graphical model is learned on the data using an existing algorithm. Then, starting at each point in the data set, a “mode” is found from that point in some way. Data that converge to the same mode are clustered together. Clarity - Justification: The paper was extremely difficult for me to follow. It is not clear what exactly is being done. Much of the discussion reviewing K-means, GMM's, graphical models, etc. should be taken as known by the ICML audience and a more clear presentation of the novel aspects of the algorithm presented instead. Currently the paper reads like a long recipe of actions to perform, which is hard to keep track of. If there is an objective function that is being optimized, it's lost in the discussion. Significance - Justification: Experiments are performed on several data sets, but it amounts to reporting numbers (normalized mutual information). It's hard to know what it all means, and why this algorithm sometimes performs better. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): See above =====