Paper ID: 698 Title: Extreme F-measure Maximization using Sparse Probability Estimates Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper looks at empirical utility maximization (EUM)-based approaches to maximizing the macro-F measure for multi-label learning tasks. The EUM approach looks at class probability estimates for each label and then tunes optimal thresholds for each label. The idea of the paper is to obtain these thresholds by processing a subset of class probability estimates, thus reducing processing time. The first and second proposed approaches - STO and FTA, do this naively by only looking at CPE that exceed certain thresholds that are fixed apriori based on some trivial upper and lower bounds on the values the optimal threshold can take. The third approach, OFO, uses an online threshold tuning strategy proposed by Busa-Fekete et al. The authors then look at tree based techniques to obtain the sparse subset of estimates in an efficient manner. This is done by implementing a tree-based classifier which uses linear thresholding functions as the splitting criterion at each non-leaf node and a logistic regression-based framework to derive CP estimates therefrom. Clarity - Justification: The paper is accessible to someone well versed with the terminology used in the problem domain. For a lay reader, the paper might present a not-so-easy reading experience. The learning algorithms can be explained with more clarity. Significance - Justification: The main contribution of the paper seems to be in the construction of the probabilistic label trees, which are derived from existing approaches. The EUM techniques applied thereupon are not very interesting. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): 1) A naive appliction of EUM would require only n CPEs to be stored at any point of time and not n*m as claimed in the paper. This is because, as the authors have themselves noticed, with macro F-measure one enjoys the freedom of tuning the thresholds independently for the different labels. 2) There is recent work that achieves better accuracies than FastXML on several of the datasets used in the paper (see reference below). This method has a much smaller memory footprint than FastXML too. Bhatia et al, Sparse Local Embeddings for Extreme Multi-label Classification. NIPS 2015. 3) There is also recent work on online optimization of F-measure which achieves stronger (finite sample) guarantees than the work of Busa-Fekete et al. Narasimhan et al, Optimizing Non-decomposable Performance Measures: A Tale of Two Classes, ICML 2015. 4) Why are no test times reported? This is very important since the prediction strategy on the PLTs basically does a branch and bound operation on a fairly large tree and may end up taking close to O(m) time per data point which is trivial. 5) The authors do present results in terms of predicted positives and claim that it is a valid indicator of testing time. This is not at all clear and a sound justification needs to be made thereof. Since the overall prediction method is that of predicting labels that exceed certain thresholds, the method will take at least \Omega(m) time simply to check whether the CPE for the different labels exceed their respective thresholds or not. This is very different from the FastXML prediction strategy where no such verification is needed, nor is any branch and bound needed. 6) Without actual wall-clock test times and comparisons thereof with other methods, it becomes difficult to judge the method properly. ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): Authors investigate the problem of optimization of F-measure in the extreme classification context. To mitigate computational cost, they consider models which generate sparse probability estimates (SPEs). 3 previously proposed optimization techniques are composed with 2 previously proposed techniques for SPE classification and the resulting combinations investigated experimentally. Clarity - Justification: Very well written and self contained. Significance - Justification: The excellent relative performance of OFO is likely to be (beneficially) surprising to the community. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): It is unclear to this reviewer whether or not FTA as reported in the experimental section is using the same threshold for all labels. Figures 1 and 2 are illegibly small. Perhaps just include one (larger!) pair for one representative data set in the submitted version and place the other (larger!) figures into supplementary material. Since computation is important in extreme classification, it would be nice to know the relative timings, especially between FTA and OFO; the intuition is that OFO is much faster, which coupled with the competitive statistical performance, would be compelling. ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This work proposes a novel consistent Tree based approach for sparse probability estimation in the context of extreme classification. The main motivation is the maximization of the F-measure. The probability label tree (PLT) trains a class probability estimator at each node. It predicts, for a given example, whether a relevant label corresponds to a leaf node of the tree rooted at the current node. The prediction procedure is a breadth first search tree traversal that potentially expands several subtrees. Clarity - Justification: The presentation is clear and the relevant details regarding the experiments are provided. Significance - Justification: To the best of my knowledge, this is the firt work tackling F-measure maximization in the context of extreme classification. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): This is a strong paper. Extreme classification methods are often evaluated using rank based measures such as Precision@K. Such performance measures favor the popular labels which represent a few percentage of the total number of labels. On the other hand, more interesting performance measures such as macro F-measure are difficult to optimize due to the size of the output space. This paper fills that gap by introducing a new approach for efficiently producing sparse probability estimates. The proposed method (based on conditional probability trees) is consistent and is extensively validated on large scale benchmark datasets. Several insights such as a comparison of OFO, FTA and STA using the same sparse probability estimates are also provided as well as sensitivy analysis to initial parameters. My main concern is: The tree structure is randomly generated (conversely to HOMER), thefore the internal binary problems can be arbitrarily complicated (large bayes error) hence affecting the quality of the probability estimates. This can hurt the final performance. Did the authors try to learn the tree? =====