We thank the Reviewers for insightful and helpful reviews!$ R1: -Coverage of all models for Thm 1: It is immediate to extend our lower bound to the case you mention where there are arbitrarily many arms above or below the threshold. Indeed, if \S is the set of all problems where the distribution of arm i is either \N(\tau + d_i + \epsilon, 1) or \N(\tau - d_i - \epsilon, 1) (allowing for this case), then \max_{\B \in \S} \E_{\B}(\L(T)) is larger than \max_{0\leq i \leq K} \E_{\B_i}(\L(T)) in Thm 1 (since \S contains the set of the (\B_i)_i of Theorem 1). So our result implies what you propose. -Equalisation of T_k\Delta_k^2 : In the proof of Thm 1 in L1120 (App), a crucial point is that for some arm T_i \leq T/(H\Delta_i^2) holds. As this arm is under-pulled so that T_i\Delta_i^2 \leq T/H, the result of L1127 and therefore the lower bound holds. So an optimal strategy maximises \min_k T_k\Delta_k^2 and therefore equalises them all at approximately T/H. -FB vs FC : In the fixed budget (FB) setting, state of the art algorithms are either based on successive reject (which cannot adapt to the true complexity) or require additional information to tune a parameter, which contrasts with the fixed confidence setting, where nearly optimal algorithms that can adapt to the complexity of the problem without prior knowledge have already been exhibited. This is why we claim the FB setting is harder for the learner. -Proof of the LB: The special case \tau=0, \epsilon=0 can be generalised translating by \tau and adding in the set of pbs the pbs where arms at less than \epsilon of \tau are set at \epsilon of \tau. At the end of the proof, the argument is the following: If P_{B^0}(A)\geq 1/2, then since P_{B^0}(\xi) >= 3/4, the intersection has a probability of at least 1/4, and the bound follows. If P_{B^0}(A)\leq 1/2, then \max_{i} \E_{\B_i}(\L(T)) \geq \E_{\B_0}(\L(T)) = P_{B^0}(A^C)\geq 1/2. -Thanks for pointing out important typos, e.g. TH instead of T/H (see Eq (10)) in L1257 - up to this typo the proof is correct). We believe L1166 is right, but L1116 should read 2ab with b=2\sqrt{log(...)}. R3: -Algo GCL*: Thanks for pointing out the relevant reference AS2011 that we were not aware of. Up to the \epsilon modification, they introduced the same heuristic but applied it toward a different aim, which is also very interesting. As you noted their results intersect only with our Thm 4, as they consider only the cumulative regret setting. Most of our paper focuses on the TBP setting which is closer to best arm identification, where surprisingly this strategy is order optimal on a large class of problems. The APT/GCL* heuristic is very efficient and practical for the TBP problem, while it is more of a theoretical tool for the cumulative regret problem considered in AS2011, as in their case they need to know exactly the highest mean value. On top of studying the TBP, we proved the intriguing fact that this strategy minimises at the same time the simple and cumulative regret when \mu^* is known. So we believe that our results substantially add to AS2011 and we will cite and discuss thoroughly this paper in the revision. -Generality/Consequences of refinements: In some specific parametric models \X (e.g. exponential models like the Bernoulli model or distribution with finite number of atoms), one can easily refine the analysis following e.g. KCG2014 so that instead of the square of the gaps, the minimal KL divergence between the arm distr (distribution) and the possible arm distr at the threshold would appear (i.e. \min_{D\in \X : \E_D X = \tau} KL(\nu_k, D) = KL_k ) - one can then consider the modified strategy that first samples all arms T/(2K) times and then samples the arm that minimizes T_k \hat KL_k. The optimal order of the TBP loss in most cases is then \exp(-T/(\sum_k 1/KL_k). It is clear, however, that such a KL dependent upper bound does not hold in as much generality as the upper bound we have now (which holds for all sub-Gaussian distr and as explained in 2.4 for any distr with finite (1+u)th moment). However, outside very specific parametric examples, it is impossible to construct an algorithm that attains this KL dependent bound. Hence, in a way, the kind of optimality you propose comes at the cost of less generality and a more complicated strategy; it is a very interesting question nevertheless. -Optimality: We use this terminology in a less refined way than you propose; in fact, our algorithm is order minimax optimal (up to a constant in the exponential) over large classes of problems with complexity H, i.e. classes of problems that contain Gaussian distr with variance 1. We would like to highlight that such a result is not straightforward as it is not available for the simple regret problem with fixed budget, where it is not clear whether the correct complexity is H or rather H_2 = \max_i i \Delta_i^{-2}. R4: - Thank you, we will correct the terminology issue about binary classification.