We thank the reviewers for their valuable comments and litterature suggestions which will help us improve the paper. Below, we answer the main points raised by the reviewers.$ (1)---- Performance on rare labels. ADIOS achieves better results on rare labels. Examples of such labels from the Delicious dataset that are better predicted by ADIOS include: 'legal', 'bills', 'retail', 'academia', 'awards', 'trading', 'cities', 'beauty', 'word'; these labels are general/abstract concepts. Further, the performance of ADIOS and MLP on 100 rare labels (with < 50 exples/label): --------------------------------------------------------- | maF1 | miF1 | P@1 | P@5 | P@10 --------------------------------------------------------- MLP | 7.03 | 11.22 | 5.34 | 2.60 | 1.75 ADIOS_MBC | 8.26 | 12.53 | 5.89 | 2.83 | 1.94 --------------------------------------------------------- It is a great point to assess performance evaluation in terms of label size. We will incorporate this into the final version of the paper. (2)---- ADIOS_RND vs. MLP For most datasets, the best label partitions assigned 25% to 50% of the labels to G1 (the optimal sizes of G1 were determined via cross-validation). Consequently, despite the randomness of the partition, chances are high that some of the labels in G1 will be predictive of those in G2. Following the suggestion of Assigned_Reviewer_7, we ran experiments on Delicious using ADIOS_RND and MLP models having approximately the same number of parameters: MLP with a hidden layer of 560 vs. ADIOS with a an intermediate layer of 400 hidden units and |G1| = 260 (i.e., ~26%). Below are the results. --------------------------------------------------------- | maF1 | miF1 | P@1 | P@5 | P@10 --------------------------------------------------------- MLP | 14.25 | 37.72 | 67.46 | 57.79 | 48.74 ADIOS_RND | 14.83 | 37.81 | 67.37 | 58.07 | 49.11 ------------------------------------------------------------------ These results confirm that the two levels of supervision can help the model even when the partitions are randomly generated. (3)---- More than 2 output layers. It is possible to use more than two output layers by recursively applying Algorithm 1 to the last subset of labels (i.e. G2). Note that in order for such strategy to be successful, it is necessary to have large enough average number of labels per instance. This will ensure large label cardinality in each of the output layers and that the label vector restricted to each of the output layers has non-zero components. (4)---- Number of the most correlated labels (parameter k) In all the experiments reported in the paper, we used a number of most correlated labels (i.e.) k=2. Indeed in all these problems, the output space is very sparse and we did not observe a significant difference between k=2 and k=3 in the preliminary tests. (5)---- Meaning of "update C_i" in Algorithm 1 After every step (when one label \ell* is moved from G1 to G2), the set of most k correlated labels (C_i) for each label (\ell_i) remaining in G1 must be updated if it has changed. That is, if C_i is contained in \ell*, the set of k most correlated labels to \ell_i still present in G1 is computed. (6)---- Sensitivity to the size of G1 ADIOS seems to be insensitive to the size of G1, once it contains enough "predictive" labels. However, it is necessary to perform a small grid search in order to find the best size for G1. For every candidate partition L = (G1, G2), this can be done efficiently by comparing the performance of a linear model trained to predict G2 using G1 as features versus the one trained on the original features X. Below are the performance of ADIOS on Delicious for various sizes of G1. A chart that provides a more complete picture will be included in the final version. --------------------------------------------------- | maF1 | miF1 | P@1 | P@5 | P@10 --------------------------------------------------- G1 = 25% | 17.43 | 39.47 | 69.50 | 58.95 | 49.70 G1 = 50% | 14.79 | 39.15 | 68.06 | 58.37 | 49.15 G1 = 75% | 14.28 | 38.90 | 68.03 | 58.09 | 48.89 --------------------------------------------------- Our best performance on this dataset (see Table 4) was achieved for G1 containing 27% of the labels. Note that the bigger G1, the closer results to MLP. (7)----- Relation to RNN There is indeed a relation with RNNs where we factor the joint probability of output variables, say P(a, b | x) = P(a|b, x) * P(b|x). The difference is that RNNs then share parameters among each layer, where ADIOS doesn’t, and RNNs push this factorization to its limit, one class at a time, where ADIOS doesn’t.