Paper ID: 788 Title: An optimal algorithm for the Thresholding Bandit Problem Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper considers a variant of the stochastic bandit problem in the case when one is not interested in minimizing the cumulative regret by sequentially pulling the arms, but rather identifying either the set of arms whose mean are above some given threshold (TBP), or the set of M arms with highest mean (TopM). Two settings are considered, one with a fixed pulling budget, in which case the goal is to output the correct TBP/TopM set with highest possible probability, and a second one with fixed confidence level, in which case the goal case the goal is to output the correct set TBP/TopM with as few pulls of the arms as possible. These formulations were previously introduced in the related literature, and the papers focuses on the gap between the known lower bound and upper bound for these problems. Importantly, they study the basic notion of complexity H = \sum_i \Delta_i^{-2} where \Delta_i is a mean gap of arm i with respect to the optimal mean, and provide upper and lower bounds. An algorithm is introduced for the TBP problem, claimed to be novel (but it is not). Section 2 focuses on the TBP problem. Theorem 1 provides a lower bound in the case of Gaussian distributions with same variance equal to 1. This lower bound is non-asymptotic. This is extended to a lower bound for R-sub-Gaussian distributions in Corollary 1, for large enough T. A simple algorithm is introduced in algorithm 1, and a upper bound on its expected loss is given in Theorem 2 and Corollary 2. Clarity - Justification: The paper is clear, and looks sound. The claim that the results close the gap between the lower and upper bound is misleading though, as this is only the case up to constants, even asymptotically, which shows there is still room for improvement (as was the case for standard bandits). Significance - Justification: The results holds for sub-Gaussian distributions, and the complexity term should have been updated for more general classes, which is not discussed. Regarding novelty, the proposed APT algorithm turns out to be the GCL* algorithm already analyzed by Salomon and Audibert that is also not cited in the present paper. Regarding significance, the statements are clear and should be valuable for the community interested in the topic. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The paper is clear and easy to follow. I have a few concerns; There is still a gap between the upper and the lower bound of the regret, even asymptotically (the right quantity to look at is then \frac{\log \Esp \cal L(T)}{T}). Thus the claim that your results close the gap is wrong. The gap is closed up to constants, and for a specific class of distributions only. This should be clarified. By analogy with the cumulative regret setting in bandits, this gap in terms of constants matters: this is basically the difference between the UCB algorithm from Auer et al., and that of Burnetas et al.(kL-UCB) or TS. UCB is optimal up to constants, but does not close the gap, and is vastly outperformed by other variants. Algorithm 1 is not new: up to the precision epsilon, this is exactly the GCL* algorithm from "Deviations of stochastic bandit regret", Salomon and Audibert, ALT 2011, that was also analyzed in the same paper for the cumulative regret. This is especially relevant to section 3.2 and Theorem 4 should be compared with their result. Looking at the proof of the lower bound, I do not understand why you restrict to Gaussian distributions. It seems you could derive, at the price of a small overhead only, a more generic lower bound that would hold for all class of distributions (or at least exponential families of dimension 1). You quickly mention a few possible extension on page 5, though. I understand that the complexity H is obviously only a proxy to the real complexity of the problem, which should be stated in terms of some information theoretic gap between the distributions, and that it happens to coincide with H in the Gaussian case with fixed variance. Thus, I would have preferred to see a deeper analysis of the problem, starting from deriving the "right" complexity for any class of distributions, together with a clean lower bound. The paper is clear, and looks sound. The claim that the results close the gap between the lower and upper bound is misleading, as this is only the case up to constants, even asymptotically, which shows there is still room for improvement (as was the case for standard bandits). The results holds for sub-Gaussian distributions, and the complexity term should have been updated for more general classes, which is not discussed. Regarding novelty, the proposed APT algorithm turns out to be the GCL* algorithm already analyzed by Salomon and Audibert that is also not cited in the present paper. Regarding significance, the statements are clear and should be valuable for the community interested in the topic. Decision: Borderline, tending to accept. l.304 : of of ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper introduces the thresholding bandit problem, a new pure-exploration problem in a bandit model in which the goal is to determine, with highest possible accuracy, the arms whose means are over some threshold. This new setup is motivated by applications to active level set detection described in Section 3.1. The main contributions of the paper are a lower bound on the expected loss (probability of error) of any algorithm (Theorem 1), and a simple, parameter free algorithm for which an upper bound on the expected loss is given (Theorem 2). This upper bound matches the lower bound up to constants factor (inside an exponential), which proves that the APT algorithm that is proposed is (order) optimal. In Section 3.2, the authors also explain that APT can be used for both best arm identification and regret minimization purposes, provided that the mean of the best arm is known. The papers concludes with experiments assessing the performance of APT compared in particular to the CSAR algorithm, suited for best arm identification in combinatorial bandits, given that the thresholding bandit problem can be reduced to this framework, provided that the rewards are shifted by the value of the threshold. Clarity - Justification: The paper is well-written, the contributions are well presented and related to existing work. The presentation of the lower bound should be improved. First it should be said that this lower bound does not cover all possible models: the B^i are particular instances in which there is only one arm above the threshold, or one arm below the threshold (depending on whether we follow the definition in the statement of Theorem 1, or that in Appendix A, which are different !). There are several issues in the statement of Theorem 1: the index j appears to be never used, an epsilon is missing in the definition of \bar{H} (it should be (d_i+2\epsilon)^{-2}, if we rely on the definition (2)), and and the notation \bar{H} is not used in display l.319-320. More generally, in all the paper there are several variants of the quantity H that change name, which is sometimes misleading. Maybe you could introduce a proper definition environment for H, rather than defining it (twice) in the captions of Table 1 and Table 2. Significance - Justification: This paper provides strong theoretical contributions. Moreover, it is very interesting that the proposed algorithm is quite different from other algorithms proposed in the pure-exploration literature: it neither relies on successive eliminations nor on the use of confidence intervals. The algorithm is also well motivated, since several practical problems on which it could be used are described in Section 3, and it reveals an interesting phenomenon: the fact that knowing mu^* permit to use the same sampling strategy for (order optimal) best-arm identification and regret minimization, which is not true in the general case. My only concerns are related to the presentation of the lower bound (as mentioned above), and some clarifications I would need in the proofs (see below). Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): As a general comment, it should be emphasized that we look for order optimal algorithms, e.g. upper bound and lower bound matching up to some constants inside the exponential (which can actually make a big difference). There are a few statements in the paper that I don't really agree with. First, could you explain why (l.362) a near optimal static strategy allocates the samples so that T_k\Delta_k^2: I don't see how it appears in the proof of Theorem 1 (which is, moreover, not valid for all the possible models). Similarly, could you elaborate more on the sentence "providing optimal strategies in the fixed budget setting is a more difficult problem than in the fixed-confidence setting" (l.550). Finally, in the statement of Theorem 4, rather the putting an optimization problem in the right-hand-side, I would simply state it with the specific choice delta=1, as in (31). For the proof of the lower bound, it should be explained why we only need to consider the case \tau=0 and \epsilon=0. More explanations should also be given at the end of the proof: I don't really understand what you mean in the last sentence, and how you handle the case P_{B^0}(A)  \leq 1/2. Some typos in this proof: l. 1077, I guess the absolute value should also include KL_k / l.1093 the last line is not an equality but an inequality / l.1166 the 3 in front of T_i\Delta_i^2 should be a 6 (if I understand correctly what you did, you had to addition 2T_i\Delta_i^2 and 4T_i\Delta_i^2). I would like to see a proof of the Corollary of Theorem 1 (that could be numbered) in the Appendix for completeness. Regarding the proof of Theorem 2, it don't understand how you obtain the second line of display (12): I think there is a missing H in the left hand side, since in its current form it seems to rely on the inequality T_k(t) \geq TH/(2\Delta_k^2), which is different from (8). This mistake may have consequences on the rest of the proof since you may have to choose \delta as a function of H. So please, check if it is correct. Other minor things: in (1262), I think there must be a missing factor 2 in the denominator, since there should be a price from the use of the peeling (of "size" 2 here). Moreover, in the left hand side, the supremum over an event is not really meaningful. You should rather write P(\exists v \in [] ; |1/v ... | \geq \sqrt{Tdelta^2/Hv}). In (12), the right-hand side should be B_i(t) (similar typo later in (17)). In (24), the inequality should be reversed. Some comments to conclude: - l.225: bounded distribution in [0,R] are not R-subgaussian ([0,1] is 1/4-subgaussian) - l.230: it is weird to use "it" to refer to the learner ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper studies the pure exploration stochastic bandit problem. The paper considers the TBP and TopM problems, where in TBP the goal is to identify the set of arms whose mean is above a threshold, and in TopM the goal is to identify the M arms with the highest mean. The authors consider the setting with fixed budget, where the learner given a budget of T time-steps has to return the desired solution with highest probability. They prove the first lower bound in this space and also introduce a parameter-free algorithm that achieves an almost tight result. The authors also discuss the connections to existing results and compare the empirical performance of the algorithm using experiments. Clarity - Justification: The paper is written clearly and the results are put into perspective by multiple discussions. Significance - Justification: The significance of this paper is in first providing a lower bound for TBP, which was not known before, and second, providing an algorithm with almost matching bound. Additionally, the proposed algorithm is parameter-free, compared to previous work where the knowledge of T, complexity H, or the subgaussian parameter was necessary. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): This is a nicely written paper on the topic. The paper introduces different variations of pure exploration bandit problems well and clearly points out the contributions. It also explains the heuristics and algorithms well and effectively discusses the related material. I believe both the results and the insights from this paper are of interest to the community. Minor comments: I find the use of the name "active binary classification" for a setting of learning with experts misleading. This name refers to a problem in active learning where the learner first draws an unlabeled sample set, then actively request the labels of some of the points in the sample set, and the goal is to output a classifier with small error over the distribution. In that setting, the labels are persistent and re-querying them does not lead to a different label. Line 303: of of --> of Line 323: E_{B} --> E_{B^i} Line 606: an homogenous --> a homogenous =====