Paper ID: 443 Title: Anytime Exploration for Multi-armed Bandits using Confidence Information Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper introduced pure exploration ``anytime Explore-m'' setting where in each time step the learner can play one arm, receive a reward of the arm and recommend the best m arms. The authors proposed an algorithm (AT-LUCB) to solve the problem and provide theoretical guarantees for the sample complexity of the algorithm. Moreover, the paper shows also the empirical performance of the algorithm compared to the fixed confidence algorithm adjusted to the setting. Overall, the paper is well written with interesting results. Clarity - Justification: The paper is easy to follow, the setting and the algorithm well described and the results are easy to identify. Significance - Justification: The paper presented a way (different from doubling trick) to use an existing algorithm for fixed confidence in order to design an anytime version of an algorithm. The authors also provided an analysis of the algorithm. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): In the paper, the authors introduced a pure exploration problem for MAB where in each time step the learner chooses an arm to play based on the history. After receiving a reward corresponding to the arm, the learner must choose an output set of the best m arms. In the paper, the learner does not have any budget or probability of misidentification, unlike in previous papers, and has to recommend the top-m arms at every time step. This setting is called anytime Explore-m and is introduced in the paper. The performance of the learner is measured in terms of sample complexity which is the number of samples required to achieve the target misidentification probability. The authors proposed an algorithm for the setting called AT-LUCB which uses LUCB, the algorithm for fixed confidence, as a subroutine. Theorem 1 shows anytime guarantees for AT-LUCB algorithm (for any time the theorem gives an probability of misidentification). According to the theorem, the algorithm has a better sampling complexity than using doubling trick for existing algorithms which is also confirmed empirically in the experimental section of the paper. Question: Anytime Explore-m setting assumes that the learner picks one arm at each time step which is not the case of AT-LUCB algorithm which picks two arms every time step. Therefore, the algorithm has access to more observations then it should. Could it be the reason of better empirical performance of the algorithm? Is it necessary to have both samples in each time step? Paper contains a few typos and mistakes (e.g. ``max'' instead of ``min'' line 6 of Algorithm 1). ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): see comments Clarity - Justification: see comments Significance - Justification: see comments Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The authors study the anytime explore-m problem where the bandit learner needs to predict the top-m arms, in addition to pulling one arm, at every time step, without any prior knowledge on the budget of pulling nor the acceptable confidence. They proposed an algorithm on top of a confidence-based algorithm called LUCB. The proposed algorithm adaptively adjusts the confidence parameter to achieve anytime explore-m. The algorithm is demonstrated to perform better than naive extension of a budget-based algorithm as well as naive sampling. The paper contains heavy theoretical results, which appear reasonable but I did not verify every line of correctness. There are some comments: * Around line 190, it is said that the second term dominates the max, while in Section 2 it seems that the first term is the dominating one. * The paper itself comes without reference nor conclusion. After checking the supplementary material, it seems that the reference is there, but there is still lack of conclusion. It is not clear how the authors can produce a complete paper within the page limit with all the heavy results. * Given that LUCB is not clearly illustrated, possibly because of page limit, it is not clear nor motivated about why the authors specifically take LUCB as the base learner. Would the proposed algorithm work with any confidence-based algorithms? Why or why not? * I am not sure if I understand the title “using confidence information” as it seems that the confidence part in the algorithm is a parameter that is adaptively adjusted. So I am not sure what “using” means. It is also not motivated on why using the confidence information should help---the introduction sounds more like “avoiding the challenging task of choosing parameter B or delta”, but not in “using” any information. * The experiments seems to be only on theorestically-justified algorithms for anytime explore-m. But if LUCB is just naively adapted for anytime explore-m by fixing it with a naive delta and forcing it to output m choices regardless of whether there is sufficient confidence, would it still work in practice? Such a strawman experiment may help clarify the value of the proposed algorithm. * minor typo: “taht” in line 322. ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper introduces a novel bandit setup, in particular the any-time version of the explore-m problem. To solve it, they transform LUCB (Kalyanakrishnan et al., 2012) into a multi-phase algorithm. Besides this, they also consider some other methods (more straightforward modifications of other bandit algorithms), and compare them to the proposed algorithm on several levels and in various forms (including a careful theoretical analysis and an experimental investigation). Clarity - Justification: The paper is very technical, but the authors do a good job in presenting the details in a clear way. Significance - Justification: The introduced problem is reasonable and has practical significance (the authors motivate the model by a particular crowdsourcing application). Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The results are not especially mind blowing, but decent. The authors have a hard time showing that the theoretical guarantees are superior to that of the baselines (but they eventually succeed). The case with the experimental results is similar in some sense: AT-LUCB (the proposed method) is sometimes better, sometimes worse than its competitors, but is always competitive, and it seems to be the stablest. The solutions are rather technical, and it is not clear whether there is some clear, deep take-home message. Nevertheless, the technicalities are serious, the discussion is detailed and right to the point, and convinces the reader about the significance of both the problem and the result. All in all, it is a valuable contribution. Some remarks: - lines 190-193 are confusing. The claim is that the second term dominates, but then the first term is used. - line 323: "that" should be "that" - line 329: \Delta_a should be \Delta_a^{< m >} - line 406: 1/n instead of n seems more reasonable - figures: it would be interesting to see the the real optimum =====