Assigned_Reviewer_5 ============================$ The Reviewer's main claim is IRRELEVANT, SUPER MISLEADING and PLAIN WRONG. We earnestly request the other reviewers to revise their scores appropriately if this review swayed their judgement. a) "..points having high gradients (or close to hyperplane) .." Distance to the decision boundary IS NOT the same as gradient (or gradient norm). Here is an example to demonstrate it, At some iteration, two training examples 'A' and 'B' are both at the same distance from decision boundary but former is correctly predicted and latter is incorrectly predicted. Even though at same distance from boundary, gradient norm of 'A' is lower than gradient norm of 'B'. Active learning picks both 'A' and 'B' with same probability. Adaptive sampling will prefer to pick B over A. Active learning does not have label information. Adaptive sampling has access to label information. Active learning (in the typical case) will not pick an example twice. Adaptive sampling can pick examples multiple times as long as it contributes to the loss. We kindly request the reviewer TO TAKE BACK THIS CLAIM and consider REDUCING THE CONFIDENCE. b) "comparison should be on time and accuracy" As we stated in the paper (lines 612-615), there was no noticeable difference in training time. The whole training took several mins, the additional computation was O(milliseconds). Assigned_Reviewer_1 ============================ ".. there is little evidence to justify this assumption.." ".. demonstration of utility of side information is not included in the experimental section" We would like to point out that we have reported the standard metrics that the optimization community tracks - convergence in objective/test error etc, and we find better performance. Please note that we plotted the graphs in the normal scale (and not log-scale). Our results would look *much* better on log-scale, but we think it is misleading to readers. Assigned_Reviewer_4 ============================ "Given this known technique, what the authors describe is pretty minor..." It is a little unfortunate we could not fully present our experiments due to the ICML page limit. We stated in line 599-605, that we tried several simpler baselines (including the reviewer proposed artificially balancing the loss). We did not find these to be better than the standard baseline - we would very happy to include it in the appendix. Also please note that by taking a one-versus-rest approach, each individual binary classification task is unbalanced - (~100 pos, ~99900 neg). We thank the reviewer about the comments on the related work section - we will revise it appropriately. We sincerely hope these minor revisions would not affect acceptance. Assigned_Reviewer_6 ============================ 1. "I do not see why does the inequality hold true" Here are some more details on why line 382 is true. Note that 0 <= n_k <= 1 \sum_k n_k = 1 This makes n_k a distribution. We can simply use the fact that ( E[\sqrt{X}] )^2 <= E[X] where E is computed over the distribution n_k, and the random variable X is {s_1, s_2, ..s_k, ..} making the LHS is v_p and the RHS is v_u. 2. "empirical performance is impressive and substantial enough to warrant acceptance." We thank the reviewer for the positive comment. Based on the quality of this review compared to some others (Reviewer 5), we request the reviewer to consider being more confident in the review and/or upgrading the acceptance if the reviewer thinks the paper warrants acceptance.