We thank the reviewers for their positive and helpful comments.$
Reviewer_1:
We thank the reviewer for bringing a very relevant paper (Quevedo et al) to our attention. They discuss an algorithm for computing optimal labeling given L label-wise posterior probabilities, under conditional label independence assumption, albeit without providing an optimality characterization or consistency. Their DP runs in time O(L^3), which is our worst-case complexity as well (it should be possible to cast our algorithm as their DP). We will add a discussion placing this in the context of our work in the final version.

“How are the losses computed for multi-label datasets?”
- Three of the datasets (Reuters, Letters, Scene) in our experiments are multi-class (not multi-label) datasets. We do one-vs-all classification and compute average loss over classes (as mentioned in the text), for all the methods.
“...I suppose that this is rather a typo than a real result.”
- That’s indeed typo -- thanks for catching the error. The correct F1 losses on test data are:
     Scene : 0.3084 (averaged over 6 classes)
     Spambase: 0.1552
     Image: 0.1458
The other reported values are verified to be correct. We will fix these in the final version.

“Running times for training and prediction should also be given.”
- Noted.
 "Jaccard loss, tuning a threshold ... is theoretically not optimal"
- This refers to threshold tuning on training data (‘Tuning \delta’ in Tables). Note that this is optimal only in the EUM sense --- on a given set of test instances, we can try to do better (at least, when we know the conditionals accurately), by optimizing the loss directly via the proposed method on the test data (which finds data-dependent \delta^*, or rather k^*). We will clarify this statement in the final version.

Reviewer_2
“paper ...extends the results by Lewis (1995)”
- We agree. We would also like to point out that our results significantly generalize the class of losses with known optimal beyond F-measure e.g. popular losses such as AUC (see our response to reviewer 3) and Jaccard measure (See [R1] for diverse applications of these measures). In addition, our results generalize Coz Velasco et al. (2009) for multiclass classification. 

“more explanations .. performs poorly on Jaccard”
We were also surprised by this observation. However, we note that our theoretical claims are asymptotic i.e. consistency for large sample sizes. Thus, it is possible that other algorithms outperform ours on smaller samples due to finite sample effects.

Reviewer_3:
“it would be interesting to include a discussion on the comparison between DTA and EUM for the multivariate losses”
The reviewer points out a promising research question. Our preliminary analysis suggests that asymptotic equivalence (extending results of Ye et al. 2012) should hold for the linear-fractional family of losses, but it is not clear for more general losses. We will add a note in the final version.

“... does the AUC loss (Joachims 2005) also satisfy the monotonicity assumption?”
Yes, AUC indeed satisfies monotonicity. Thank you for pointing this out. The AUC loss on binary predictions (Joachims 2005) reduces to: (FP. FN) / (TP+FN)(FP+TN) = (s - TP) (p - TP) / (p(1-p)), where s is the fraction of predicted 1’s and p is fraction of actual 1’s. 
Interestingly, while prior work on AUC has focused on optimizing prediction of continuous scores, our approach is able to optimize explicit label predictions. Note that optimality results for continuous scores need not trivially extend to optimality results for discrete label decisions. To the best of our knowledge, our result for label decisions is novel, and is an important contribution to the literature.

References:
[R1] Sokolova, Marina, and Guy Lapalme. "A systematic analysis of performance measures for classification tasks." Information Processing &amp; Management 45.4 (2009): 427-437.