First of all, we would like to thank the reviewers for their thorough readings and all the positive feedbacks. We appreciate all of them and they will improve the quality of the revised version of the paper.$
We agree that this paper is dedicated to fill a small, yet crucial, gap in the multi-armed bandit literature. It is now a very popular area, as illustrated by the tremendous amount of papers on variations of the base model. However, even the simple, original and fundamental problem has still some crucial open questions. It is equally important to close them before introducing small or fancy variations.

Reviewer 3 points towards the paper "The Best of Both Worlds: Stochastic and Adversarial Bandits” for an interesting algorithm which we did not discuss. We should definitely compare with that paper in the final version. The algorithm presented in it is not anytime and its merit is more in providing an adaptive solution to both stochastic and adversarial cases. It is slightly suboptimal in both cases: in the stochastic setting for example its regret bounds are O(log^2 T) while the optimal dependency is O(log T). In comparison, our goal is really focused on an optimal algorithm in the stochastic setting only. We will include a discussion on the difference between the two papers.

Reviewer 4 writes rightly that Thompson Sampling enjoys bounds that are close to ours while being anytime. UCB-like algorithms have however the interesting additional property of having a lower computational complexity: while n steps of Thompson Sampling with K arms require at least O(Kn) computations since all arm posteriors have to be sampled, UCB-based algorithms like MOSS-Anytime can be implemented in such a way that this number is closer to O(n) when n gets big. The idea behind this is that the order of the indexes of the arms that are not pulled does not change so when n is big we often only compare the previous best arm to the second best arm to check that the previous best is still the best.

Both additions will be included in the paper by reducing the full information analysis. In regard of these two remarks on previous algorithms, we agree with Reviewer 1 that a table showing clearly each type of bound for all algorithms discussed would be more useful.

Reviewer 3 also notes that the KL is not defined for all reward distributions in lemma 2: this is absolutely correct and we will replace "all reward distributions", which is indeed wrong, by a brief but specific statement on the distributions to which it is applicable. The goal of this lower bound lemma is to give an idea of the performance that we should expect in terms of $\Delta$ in the general case (and it does not affect our results).