Paper ID: 969 Title: Interactive Bayesian Hierarchical Clustering Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper considers the problem of hierarchical clustering, where in addition to distance information (the points live in R^d), the algorithm also has user-supplied constraints of the form ({a,b},c) indicating that the hierarchy should have a cluster containing {a,b} but not c. There are classic algorithms for efficiently finding a hierarchy consistent with a given set of such triplet constraints (Aho et al 1981), but the problem considered here is (a) how to combine that with the metric space information, and (b) considering an interactive process, where a user is presented with a (partial) clustering and then provides a violated triplet if the clustering is not correct. (Any refinement of the target clustering T* is viewed as correct). Issue (a) is addressed by positing a Bayesian model called the Dirichlet Diffusion Tree from (Neal, 2003), and then showing that an existing MCMC algorithm for sampling from the posterior can be augmented to include the given constraints while maintaining aperiodicity and irreducibility. Issue (b) is addressed by interleaving random subtree querying (giving the user the subtree for a random subset of leaves) with a method that queries a high-variance subtree. Clarity - Justification: The paper is generally well-written and easy to follow. Significance - Justification: Positives: The setting seems an interesting and natural one, the proposed method looks quite reasonable, the paper is well written, and the experiments (on small datasets with 150 points) look pretty good. Negatives: There is not that much theoretical analysis. For instance, (1) in addition to the chain being irreducible, can one say anything about it being rapidly mixing? (2) can one talk about conditions on the data under which the method can be guaranteed to work well? Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): As mentioned above, on the positive side the setting seems interesting and natural, and the paper is well written and clear. On the negative side, there's not much theoretical analysis. True there is a proof that the chain is aperiodic and irreducible, but that's often not a big challenge -- I was hoping to learn more about when one should expect this to work well. Also the experiments are quite small. Question: does the algorithm scale? E.g., if instead of 150 points you used 10^5 points? Overall, it was a pleasant and interesting read, though I was hoping for a bit more analysis. ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper proposes an interactively semi-supervised method for hierarchical clustering, based on the Dirichlet diffusion tree. The algorithm uses human-in-the-loop-based MH steps to guide inference, leading to trees that match the user’s constraints. Clarity - Justification: The paper was well-written and easy to navigate. I had a couple of uncertainties about how inference was carried out, which I discuss under the detailed comments. Significance - Justification: I have not seen this form of user-interactive clustering in a hierarchical context, but it seems a good combination of ideas, given both the interpretability of hierarchies and the difficulties of navigating the space of trees. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Overall I think this is a nice paper, however there are a couple of things I would like the authors to clarify. I would also like to see a little more thorough evaluation, as described below. I’m a little confused about the mechanism for obtaining triplets. As I understand from Section 3.2, the user identifies triplets in subtrees generated independently of the current state of the sampler. Would it not be more reasonable to identify violated triplets in the current tree? Secondly, it seems that the aperiodicity only holds if the triplets are all obtained prior to inference. If, as suggested in 3.3, we add to the set of constraints during inference, we have no guarantee that the current tree does not violate the constraints. If we retain the aperiodicity, there is no direct way to correct errors in the current tree. In inferring the variance of the tree, it is a little unclear whether the variance is calculated between time-points in the Markov chain, or between a set of trees sampled from the MH proposal distribution. Of the four experiments, only two (20 newsgroups and Zoo) are clearly hierarchical clustering problems. I feel it would be worthwhile comparing with flat clustering methods with constraints, as described in 1.1. Further, while I realize that user experiments add difficulty and cost, I think there would be a clear benefit here of adding MTurk or similar qualitative evaluations, to assess tree “quality”. The goal of the paper isn’t necessarily to provide the tree with the highest likelihood, rather to find the desired partition out of multiple possible partitions – which might depend on user preference. ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper presents an interactive Bayesian approach to hierarchical clustering. The problem applies to situations where hierarchical clustering cannot be found on the basis of constraints alone and the geometry of the data must also be accounted for. The user provides triplets saying elements that must be clustered together and ones that must be excluded. The algorithm finds a tree consistent with these constraints, sampling a constrained distribution over candidate trees. The approach includes random subtree queries and active subtree queries, as well as interleving the two. Clarity - Justification: The paper is well written and relatively easy to follow. Significance - Justification: The approach seems to work, but the evaluation could do more to illustrate the contribution with respect to related works. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The evaluation and experiments, likely due to space constraints, are a little hasty and unclear. First why are these the right metrics to evaluate the approach? Then when claims are being made about which approaches were better/worse than others, it's unclear which graph in Fig 6 is being referred to. Some of these statements are not universally true across all of the graphs but perhaps true in one of the graphs only. Your initial claim was that the geometry of the data really matters, so would be nice to see that point made more concretely in the results, how does this come into play in your results? The notion of which types of queries are reasonable to ask a person is interesting. While outside of the scope of this paper, it is nice to see the comments about "simple" being unrealistic for a person. Related to that I assume it will matter very much if the person is consistent in their labels of "cuteness" or whatever the clustering task may be. It is interesting to think about how the interleaving process itself might influence people's consistency in this regard. High variance versus random might be easier or harder to judge whether or not a violation exists....and I would assume that the ability to get "correct" labels might end up being the most important element. =====