We sincerely thank all three reviewers for their constructive technical suggestions, and we are pleased to hear that all three reviewers feel that our proposed method is novel and interesting. We have revised and improved our manuscript in response to reviewer comments, and we appreciate the opportunity for the reviewers to consider our revisions, addressing the issues raised. $ MAJOR CONCERN #1 (By Reviewers 1 and 5): CLASSIFIER CHAIN RELATED BASELINE METHODS (CC, PCC, ECC) USED IN EXPERIMENTS WERE NOT THE LATEST VERSION. Answer: This is absolutely correct. In fact, after submitting the paper, we realized that the standard MEKA package implementations of CC related methods were obsolete, and thus they had low accuracy or could not finish. To address this, we reimplemented these algorithms efficiently ourselves, following some of the papers actually suggested by reviewers. We updated the related work section and all experimental results, and we filled in all missing entries in the results tables. The CC related methods now achieve better performance and run much faster; in particular, PCC outperforms CRF on 2 out of 5 datasets. We have compared these new results against published results and are convinced that they reflect the state-of-the-art. Our proposed CBM method still performs competitively, achieving the highest subset accuracy on 4 out of 5 datasets. No conclusion regarding CBM is changed. The updated subset accuracy results on five datasets are listed below. (The results for CRF, CBM, BinRel and PowSet do not change; we list them here for reference). SCEN RCV1 TMC2 MEDI NUSW 62.9 48.2 26.2 10.9 26.0 (CC-LR) 64.8 48.3 26.8 10.9 26.3 (PCC-LR) 60.6 46.5 26.0 11.3 26.0 (ECCL-LR) 63.1 49.2 25.9 11.5 26.0 (ECCS-LR) 68.8 46.4 28.1 10.3 26.4 (CRF) 69.7 49.9 28.7 13.5 27.3 (CBM-LR) 51.5 40.4 25.3 9.60 24.7 (BinRel-LR) 68.1 50.2 28.2 9.00 26.6 (PowSet-LR) CC-LR = classifier chain with greedy prediction (LR = with logistic regression learner) PCC-LR = probabilistic classifier chain with beam search (width=15) prediction (as suggested by reviewers) ECCL-LR = ensemble of classifier chains, voting at individual label level (named ECC in the submission) ECCS-LR = ensemble of classifier chains, voting at label subset level (newly added baseline) CRF = Conditional Random Fields CBM-LR = our Conditional Bernoulli Mixtures BinRel-LR = binary relevance PowSet-LR = label power set MAJOR CONCERN #2 (Reviewer 5). DOES CBM ESTIMATE THE JOINT OVER ALL LABELS OR THE MARGINALS PER LABEL? Answer: CBM definitely estimates the joint over all labels, a critical advantage over the Binary Relevance method. One can see this from both mathematical and empirical explanations below (some in the submitted paper). It is indeed understandable to think that since CBM factorizes each component into independent predictors, overall that leads to the estimation of marginals and effectively optimizes for Hamming Loss, but this is incorrect. Our apologies for not making this more clear in the paper. That CBM estimates the joint is supported by (1) the mixture form, (2) the training objective, and (3) the training and prediction procedures: * Eq. 6 shows that for CBM, even though each component takes full factorization, the overall covariance matrix learned is non-diagonal, which implies the probability estimated by the mixture is not a product of marginals. (This is similar to isotropic Gaussian Mixtures estimating a joint, where each local Gaussian component is fully factorized.) Empirically, one can see this argument in the "case study" section on the example image (Figures 1 and 2): the joint probability estimated by CBM indicates that the Pearson correlation coefficient between "reflection" and "lake" is 0.5, significantly higher than 0. This makes the point that CBM correctly predicts the joint mode, rather than the incorrect marginal mode. * The training objective of CBM is to maximize the likelihood over observed label subsets (Line 344), or equivalently, to minimize the KL-divergence between the true joint and the estimated joint. CBM would incur a large penalty if it ignores the joint and only estimates marginals. The same objective is used by the CRF and PowerSet methods, optimizing Subset Accuracy. It is known that joint mode is optimal for Subset Accuracy and marginal mode is optimal for Hamming Loss. The Binary Relevance method estimates marginals; if CBM estimates the joint, we expect CBM to perform better than Binary Relevance under Subset Accuracy but worse under Hamming Loss, a fact validated in the result tables (in both the paper and the supplementary materials). * Fig. 3 shows the quality of joint estimation with different numbers of components K. At the left-most point when K=1, CBM estimates marginals due to limited capacity. As K grows, CBM moves away from a marginal estimator and becomes a better joint estimator and achieves better Subset Accuracy.