Paper ID: 1114 Title: Conditional Bernoulli Mixtures for Multi-label Classification Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The authors propose a probabilistic mixture of experts model for multi-label classification. Training of the model can be done efficiently using expectation maximization and both MAP and marginal prediction are considered, where a combinatorial search method is proposed for MAP inference. The paper discusses a large number of possible alternatives for this task; in comprehensive experimental the proposed method is shown to perform well. Clarity - Justification: See below. Significance - Justification: See below. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Review summary. A very well-written paper, simple but flexible and effective method; strong experimental validation. A really useful contribution. Review. This paper is very well written and I enjoyed reading it. Related work is discussed in detail and make the advantages of the proposed method clear: + Flexible: both the conditional mixture distribution and the component distributions can be chosen for convenience; + Simple training and prediction: training is simple EM, prediction of both marginals and MAP are feasible, allowing optimization of various loss functions for multilabel classification. + Good scaling behaviour: the method seems efficient for large scale applications, both in the number of instances and in the number of labels. In terms of future work, I wonder whether the authors considered regularized MLE for estimating their model, for example by adding an entropy term for the mixture distribution to ensure smoothing such that not just a single expert model is responsible for the prediction. (A full Bayesian mixture approach seems difficult due to switching between components.) I have no serious criticism about the work and believe it will make a nice contribution to the ICML program. Minor details. Line 562, Algorithm 2, how is this a dynamic programming method? To me it seems more like a search method with an upper bound stopping criterion. Line 770, Table 2, please could you add runtimes and standard errors for each cell? (There seems to be sufficient white space in the table to allow for this). Line 882, "Springer". Line 962, "McAuliffe". Line 966, "EM". ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The authors present a new approximate method for estimate the joint mode over all labels in multi-label classification. The computational burden of estimating the joint mode is avoided by introducing a probabilistic model that combines a set of Bernoulli distributions. An EM-inspired algorithm for fitting the parameters is presented. Clarity - Justification: In general the paper is well written. I could follow most of it. A few things are unclear (see detailed comments). Significance - Justification: The method that is introduced by the authors is novel and interesting. Estimating the joint mode in multi-label classification is not an easy job, and this method has the potential to make a contribution in that direction. However, it is still unclear what the method is in fact estimating (see detailed comments). Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): I like the general idea behind the method that is introduced in this paper. However, I have two important comments that let me hesitate to accept this paper. First of all, it is not clear from the paper whether the method indeed intends to estimate the joint mode. I have the feeling that the method rather intends to estimate the marginal modes (so, optimizing the Hamming loss instead of the subset accuracy). Eqn. 7 represents a mixture of distributions for which the local joint probability is factorized into marginals. As discussed by the authors, each component of this model can be interpreted as a local model, since the feature space is divided into several regions. For each region one component of the mixture applies. By assuming conditional independence for such a local region, one is not really estimating the joint mode anymore. From a nearest neighbor perspective, the approach of the authors rather minimizes the Hamming loss, for which it suffices to fit marginals locally, in a given neighborhood. If one would implement a nearest neighbor method for estimating the joint mode, then it would be more reasonable to model the joint conditional probability P(vector y | vector x). So, I believe that the method of the authors is making here a rather crude approximation, and one should be able to construct synthetic datasets for which this method performs poorly by comparison with other methods. My second comment concerns the comparison to the state-of-the-art. Both in the discussion in Section 5, as well as in the experimental analysis in Section 6, I believe that the comparison to the state-of-the-art is not very fair. Especially the discussion of PCC, which is very related to the method of the authors, is incorrect. In recent years several fast inference algorithms for PCC have been proposed: Several authors have introduced more efficient inference methods, see for example Kumar et al. Learning and inference in probabilistic classifier chains with beam search, ECML/PKDD 2012. Kumar et al. Beam search algorithms for multi-label learning, Machine Learning 2013. Dembczynski et al. An analysis of chaining in multi-label classification, ECAI 2012. The authors are apparently unaware of those papers, but especially the last paper adopts a very similar sampling idea as the current paper for finding the joint mode efficiently. So, the "--" signs in Table 2, which indicate that PCC did not terminate within 80 hours, do not reflect the state-of-the-art. The experiments for PCC should be redone with the correct inference algorithms. If the paper would be accepted, I think that the above points should be addressed. The method of the authors might be interesting, but it also has a number of disadvantages, such as the approximations needed in the model formulation of Eqn. 7 and the iterative nature of the EM-algorithm, which might get stuck in local optima. The competing methods of course have similar disadvantages, but at this moment the discussion is unfair. Minor comments: -l232: "conditioning on features greatly reduces the need to estimate label correlations" This is not entirely correct. For univariate loss functions such as Hamming loss or macro-AUC, it is still much easier and better to model label correlations (dependencies) instead of going into more complicated models that estimate joint probabilities over labels. -l353: it is unclear to me what the upper bound means. KL divergence quantifies a difference between two distributions. It is unclear what those two distributions are. In particular, what is the difference between the parameters alpha and theta? ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The authors introduce a new multi-label classification algorithm that is based on a mixture of conditional Bernoulli distributions. The method is clearly introduced and instantiated with two different base learners, logistic regression and gradient boosting trees. The algorithm is verified in experimental studies in which it gets competitive results. Clarity - Justification: The paper is clearly presented. Significance - Justification: It is somehow surprising that a similar method has not been previously introduced (at least I am not aware of any reference). The use of Bernoulli mixtures seems to be quite natural for multi-label problems, however, the cost of this approach can be substantial. The only question is whether the contribution is large enough for ICML. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The complexity of the method can be substantial (K \times L), so the method cannot be probably easily applied to problems with a larger number of labels. The authors should report training and test times to enable wider comparison of the methods. PCC can be efficiently run by using search techniques ("Beam search algorithms for multilabel learning", "An Analysis of Chaining in Multi-Label Classification", "Using A* for Inference in Probabilistic Classifier Chains") The correspondence to Power-Set approach is not so clear, since CBM still needs to estimate all binary labels in each component of the mixture. =====