Paper ID: 410 Title: PAC Lower Bounds and Efficient Algorithms for The Max $K$-Armed Bandit Problem Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper treats the Maximum K-armed bandit problem. The problem is similar to a standard multi-armed bandit problem, however, rather than attempting to find the arm with the highest expectation, the goal is to obtain a single sample that is at most \epsilon-optimal (i.e. at most \epsilon less the supremum of the support of any arm) with probability 1-\delta. An algorithm that achieves this goal is called (\epsilon,\delta)-correct, and algorithms with low complexity (number of samples required before stopping) are desired. The authors give an index based policy for achieving this goal, provide a problem-dependent complexity analysis (i.e. give an upper bound on the expected number of samples before the algorithm stops), and also provide a nearly-matching lower bound on the complexity of any algorithm that is (\epsilon,\delta)-correct. The authors discuss some other extensions of their results, including a robustness analysis, and a comparison with the Unified-Arm Bandit problem. Clarity - Justification: The text is easy enough to read. There are a few grammatical errors, but these do not hinder understanding, and can be easily corrected. If the paper is finally likely to be accepted I shall include a list of them. Unfortunately I did not find the proofs easy to follow. The paper is very heavy on notation, and there are few explanations of the various quantities and constants used in the proofs. As a result, it is difficult to keep the overall argument of the proofs in mind while reading them. Some paragraph headings could help with this, or some pictures to illustrate all the different quantities. Significance - Justification: I am not an expert on this field of bandit problems, however, the authors seem to make a significant contribution. Their lower bound on the complexity of the algorithm employs a standard change-of-measure argument, however, to my knowledge, it has not been achieved before in this setting. Providing an efficient (\epsilon,\delta)-correct algorithm that improves on the performance of other algorithms is also a good contribution. However, I have some concerns about the proof of the lower bound, and hope the authors can convince me of the correctness of their work here. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): After spending some time on this paper, I feel that the material is good, probably correct, and I would like to accept it to the conference. However, I am held back from recommending acceptance by my difficulties in reading the paper. I am looking to improve my overall rating after reading the rebuttal. I am still unsure about a part of the proof of Theorem 1, and this is impacting in a major way my overall rating for the submission: - In the proof of Theorem 1, lines 473-479, I do not understand how you cope with k'. I think I follow your argument lower-bounding E_0[T_k] for all k\ne k'. On the other hand, T is the sum of T_k over all k. Hence, you can lower bound in expectation at least all the summands, T_k, except for T_{k'}. However your sum in the display on line 477 is over all k \ne k*. I do not follow how knowing that t_{k^*} \ge t_{k'} helps you to do this. Can you please explain? Here are some other things that I found to be unclear: - line 223: 'therefore tail functions are non-increasing' This is not true. Using your definition, tail functions are functions of \epsilon, and they increase as \epsilon increases. - lines 233-245: Although some intuition is given about Assumption 1, there is very little. Some example cases would be extremely helpful here, and I would have greatly appreciated a discussion about the role of \epsilon_0 (What limits the choice of this parameter? Is there a trade-off between \epsilon_0 and the final form of G_*?). - lines 322-350: The notation for the new modified CDFs is very heavy and confusing. Could you illustrate this with a figure? Particularly, where all the different \mu constants lie in the two cases, and some idea of the different shapes of the modified CDFs would be most helpful. - lines382-383: This is incorrect, although it does not affect the proof, as you have forgotten that it could also be the case that P(B^C) > 1/2 for all k. - line 396: There is an extra 't' in the definition of D_k which, I think, should not be there. - line 437: Please put the definition of f_k closer to its use. The sentence that describes it is not very clear, and the choice of letter is unfortunate (I would expect small f_k to represent a pdf, especially since F_k represents a cdf). - line 440: This sentence is not clear at all. There should be a colon immediately after 'before'. - line 477: What is this expectation over? Is it actually E_0? - line 538: It is quite unclear what the logarithmic term is. On close inspection of L one finds that it is a term logarithmic in \epsilon. Please make this more explicit. - I have not checked the proof of Theorem 2 as thoroughly, but it also suffers from having rather a lot of notation. - lines 804-844: There is no mention of \epsilon_0 in this illustrative section. I suppose this is because it can be set to be larger than the minimum separation gap, but some discussion would be helpful here. ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): Studies the max-bandit problem under a tail assumption and shows upper and lower bounds that are novel. Authors also consider performance of pooling arms into a single arm and sampling from that, a notable comparison due to its lack of assumptions to use. Clarity - Justification: Some minor English typos, but nothing that prevented comprehension. Significance - Justification: The fact that the tail assumption must be input, with no obvious way to estimate it without extensive samples is troubling. The robustness result just holds for a constant difference, not the exponent so it leaves much to be desired. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): This paper establishes new results Max K-Bandits problem in the pure exploration setting. The paper offers matching upper and lower bounds (up to logarithmic factors) for PAC setting, under the assumption regarding the tail distributions of the rewards. The proofs of both the upper and lower bounds are correct, and follow standard arguments from cited sources. Prose and notation are adequately clear throughout. It is a nice math problem but the motivational examples seem a bit far fetched. The populations and chemical examples seem the most reasonable but its unclear how one could guess G* in these cases. ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper presents upper and lower bounds on the sample complexity of algorithms that find, with high probability, a reward within epsilon of the best possible reward in a bandit models in which each arm has a bounded support. A non-parameteric framework is considered, in which there is a known lower bound on the CDF in an epsilon_0 neighborhood of the maximum of each arm, G^*. The results are expressed with this function G^*. The proposed algorithm, Max-CB, is a variant of UCB modified to tackle this Max-K bandit problem, and its sample complexity is proved to match (up to multiplicative constants) the lower bound given in Theorem 1. Section 5 then studies what happens when we know the lower bound up to a multiplicative constant, and Section 6 studies the optimal sample complexity of algorithms that sample the arms uniformly at random, showing in particular that for some specific problem instances, there is nothing to gain by departing from uniform sampling. Clarity - Justification: The paper is well-written, and the contributions are clearly stated. The proofs are sometimes a bit difficult to follow, due to long sentences with weird punctuation (examples: 416-419, 440-444). Significance - Justification: This paper provides the first sample complexity lower bound for the Max-K bandit problem, which is very interesting. Yet, as explained below, some elements in its proof need to be clarified. Moreover, a matching algorithm is provided. However, it is not explained in the paper whether this algorithm share similarities with existing algorithms in the literature (discussed in the first paragraph of p. 2, but never really presented). In particular, the paper would greatly benefit from some numerical experiments, including state-of-the-art algorithm, in place of the upper bound comparison (with the uniform benchmark only) given in Example 1 and Example 2. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): As explained above, my main concern is that the algorithm is not really compared to state-of-the-art algorithms, that are not really presented. The scaling of the associated bounds are discussed in the introduction, but it would be nice to have an empirical comparison as well. The proof of Theorem 1 relies on a change of distribution argument inspired by [Mannor and Tsitsiklis 04]. For each arm, one needs in particular to construct an alternative model H_k in which the support of arm k is strictly larger (mu^* + epsilon) than the support of the best arm in the original model (mu^*). Under H_0 and under H_k, the distribution of arm k does not have the same support, which creates an extra difficulty: P_k^H is not absolutely continuous with respect to P_0^H (which has a smaller support for arm k) and the Radon-Nikodym derivative dP_k^H / dP_0^H that appears in the display l. 438 cannot be defined. However, I agree (in spirit) that, on the event C_k (where no observation outside the support of P_0 are made), one should be able to write something like that, but it's not clear to me how. Also, how to obtain this equation should be clarified. It should be clearly mentioned that f_k(h) is a definition, and that (1-gamma_k/F_k(\bar{\mu}_k) comes from a ratio of densities (do we need to assume that F_k(mu) has a density to write this?). Finally, a last issue in the proof of Theorem 1, is that when reading it without looking at the appendix, one believe that the only use of the concavity of G_* is in the end of the proof, when we want to get the max(epsilon,mu^*-mu^*_k) inside the argument, but it should be mentioned that Lemma 1 also makes use of the concavity, and that one could not obtain the display l.477 without it. The proof of Theorem 2 relies on (5), that gives a first (deterministic) upper bound (L''^2) on the number of samples used by the algorithm (then the rest of the analysis provides a better bound on some even E_1 that has a high probability, and uses that E[T \ind_{E_1^c}] is then upper bounded by L''^2 P(E_1^c) \leq 1). However, I am not convinced that (5) trivially holds for all arms k. It holds by definition for the (random) arm k^*_T that makes the algorithm stop, but this arm is defined as argmax_k Y_{C^T(k)}^k and not as argmax_k U(C^T(k)). If these two concerns on the proofs of Theorem 1 and Theorem 2 are clarified in the rebuttal, I will change my rating to weak accept. I would like to mention some missing references for the best arm identification literature : the LUCB algorithm of [Kalyanakrishnan, Tewari, Auer, Stone, PAC Subset Selection in Stochastic Multi-armed Bandits, ICML 2012] is an important algorithm in the PAC setting, and for the lower bounds, there is the recent work of [Kaufmann, Cappé, Garivier, On the Complexity of Best Arm Identification in Multi-Armed Bandit Models, JMLR 2015]. To conclude, some minor typos. Theorem 3: do we need to assume epsilon < epsilon_0 ? l. 388: it took me some time to understand that you were extending the definition of \bar{\mu}_k here l. 396: there is an extra t before F_k(\bar{\mu}_k) l. 429: let h be (and not to be) l. 477: there is a 16 missing in the denominator =====