Paper ID: 725
Title: Optimal Classification with Multivariate Losses

Review #1
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): N.A.

Clarity - Justification: N.A.

Significance - Justification: N.A.

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): This paper studies the minimization of expected multivariate losses. In particular, assuming label/class independency, the paper shows that the optimal predictions for the TP-monotone losses can be computed with the conditional probabilities of the positive class alone. This paper can be considered as an extension of Lewis (1995), where F-measure is studied under the same independent assumption. It is a good paper and is very clearly written.  This paper definitely extends the results by Lewis (1995); it is maybe not a huge leap forward though, considering the importance of F-measure (comparing to other losses listed in Table 1.). I’d like to hear more explanations on why the proposed algorithm performs poorly on Jaccard. At first sight, I’ve thought that was due to the independency assumption, but then I noticed the very good performance on F-measure.  

=====

Review #2
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): This paper studies optimal solutions for multivariate losses. The main result is a generalization of the characterization of such solutions from F-measure to other multivariate losses which are monotonic in the true-positive count. In particular, if the labels are conditionally independent then the optimal prediction is obtained by sorting the labels by their marginal probabilities and applying a threshold. An efficient algorithm is proposed for estimating the optimal prediction in practice, and is shown to be consistent.

Clarity - Justification: The paper is well written.

Significance - Justification: I think that the results are interesting and generalize previous results in a meaningful way. As for novelty, similar questions were already studied in a similar setting of minimizing the loss w.r.t. the expected confusion matrix (rather than minimizing the expected loss). So the results here seem like a natural extension. 

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Given the fact that similar results exist in the EUM setting, it would be interesting to include a discussion on the comparison between DTA and EUM for the multivariate losses which are handled here. The results in Ye et al. (2012) seem to apply only to F-measure.  Line 126: “this is satisfied by most, if not all, loss functions”: does the AUC loss (Joachims 2005) also satisfy the monotonicity assumption?  Typos: Line 679: “compute compute”

=====

Review #3
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): The paper concerns a problem of minimization of multivariate loss functions over a joint distribution of a set of test examples. This is a different setting than a more common approach, in which population-based TP, FP, FN, FP are used to define a performance measure. The paper generalizes existing results to a wide family of loss functions. The experimental study is given to illustrate the theoretical findings.

Clarity - Justification: This is a nicely written paper that is easy to follow.

Significance - Justification: The authors discuss in depth the problem of multivariate loss minimization over a joint distribution of a fixed set of test examples. I admit that the main contribution is, to some extent, a straight-forward extension of previous results. However, it is worth publishing as the entire paper nicely introduces the problem and exhaustively gathers and presents the theoretical results.

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Dynamic programming for optimizing performance measures has also been introduced by Quevedo et al. 2012 ("Multilabel classifiers with a probabilistic thresholding strategy", Pattern Recognition, 45:876–883, 2012). I suppose that this method should be quite similar to the approach introduced by the authors. A short discussion and citation should be added to the paper.   The experimental part needs some clarifications:   - How are the losses computed for multi-label datasets?   - Some of the results obtained by the new method are incredibly good, for example, the F-measure loss for the scene dataset is 0.0374, while for the best competing method it is more than 10 times bigger, i.e., 0.39. I suppose that this is rather a typo than a real result.   - Running times for training and prediction should also be given.   Could you comment this sentence "However, in case of the Jaccard loss, tuning a threshold on training data seems to perform better though it is theoretically not optimal."? Why it is not optimal?  

=====