We thank the reviewers for their positive feedback. Below we address the main critical remarks pointed out in the reviews. $
Rev 1:

Regarding FTA: we use the same threshold for all labels. This is the simplest, but also a very robust tuning method. 

We are going to defer some figures to the Appendix, and increase the size of the remaining ones.

The running times of tuning methods: by using sparse probability estimators the tuning times are two orders of magnitude lower than the training times of PLT and FastXML. There is almost no difference between the running times of FTA and OFO. STO is slightly slower due to sorting of the posteriors. We will add this discussion to the paper. 

Rev 2:

We thank you for appreciating our work.  

Regarding the learning the tree structure: In the experiments, we use a random tree structure, so we believe that there is still room for improvement. Learning the tree structure is indeed an important and challenging issue. We currently work on this problem. Ideally, we would like to find an online algorithm that produces trees which are competitive in terms of predictive performance and test time complexity (e.g., using the idea of Huffman trees). 


Rev 3:

Regarding clarity: We will improve the clarity and the presentation of the paper.

Regarding main contribution: 
We strongly believe that the paper gives the right guidelines for maximizing the F-score and other complex performance measures in the extreme classification settings. The computational performance of the approach introduced in our paper is far beyond the one of existing methods. Despite being very simple, the idea of sparse probability estimators significantly changes the way we can solve extreme classification problems under complex performance measures. 

In addition, we also introduce probabilistic label trees, which can be trained online. In combination with OFO we obtain a very powerful and purely online tool for extreme classification.


1 ) We agree with the reviewer, that is why we have written in line 127 that in the naive application of EUM we have to compute and *possibly* store n*m predictions. We make this issue more clear.

2 ) We are aware of SLEEC and other new approaches for extreme classification. We use PLT and FastXML as two exemplary methods in our framework. SLEEC could also be used but it needs more time than FastXML for testing new examples. We plan to test it and other extreme learners in the introduced framework in the extended version of the paper.


3) We are also aware of the STAMP method. We do not see, however, that it could easily scale to extreme classification or that it could be used in our framework. The STAMP is devised for binary tasks, thus it is not able to handle large label spaces where one-vs-all training should be avoided.

4,5,6) We did not initially report wall-clock times, as FastXML had huge I/O overheads. Excluding these overheads, both algorithms perform similarly. However, a fair comparison is more complex than it seems (different implementations, JAVA vs. C++, L_1 vs. L_2 regularization). PLT indeed uses a branch and bound strategy which might lead to O(m) test time complexity. It manages, however, to cut many subtrees during search, which results in a significant speed-up in comparison to 1-vs-all. In FastXML, in turn, the test time complexity depends on hyper parameters such as stopping criterion used in the tree learning procedure and the number of trees used in the ensemble. Interestingly, for some of the datasets (AmazonCat, RCV1), the trees produced by FastXML are larger than those used in PLT. 

We agree that the number of predicted positives is not the best surrogate for indicating the test time complexity (it only gives a lower bound of the number of inner products computed in the nodes of the PLT). Below we report the average (per instance) wall-clock test times and the average (per instance) number of inner products computed by PLT and FastXML on the test sets. We are going to add these results to the revised version of the paper.

Data set | Running times of: PLT FTA, PLT STO, PLT OFO, FastXML | # of inner products of: PLT FTA, PLT STO, PLT OFO, FastXML | 
RCV1 | 0.33, 0.78, 0.19, 0.96 | 252, 354, 212, 747 | 
AmazonCat | 0.37, 2.23, 0.85, 1.14 | 225, 2272, 482, 871 |
wiki10 | 1.69, 2.18, 1.87, 3.00 | 534, 13682, 2518, 541 |
LSHTC | 0.22, 0.71, 0.55, 1.17, | 175, 28448, 2875, 900 |
Amazon | 0.29, 1.45, 0.22, 1.39 | 132, 5557, 125, 796 |

The reported times do not include I/O overheads. Running times of PLT depend on the thresholds and therefore are reported for each tuning method separately. The results show that PLT along with OFO and FTA are faster than FastXML.