Paper ID: 746
Title: Anytime optimal algorithms in stochastic multi-armed bandits

Review #1
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): This paper proposes an anytime algorithm for the stochastic multi-armed bandit problem, which achieves optimal minimax regret and distribution-dependent regret when the gaps are equal but does not have the typical all-gaps dependent regret as in the UCB algorithm.

Clarity - Justification: Since the contribution of this paper is clear and technical I do not have much comment on the clarity aspect. 

Significance - Justification: The main contribution of this paper is showing that replacing the known horizon n with t in the MOSS algorithm does not hurt the regret bounds of MOSS up to constants, while making the algorithm anytime. The idea of removing the (log K)^1/2 term in the minimax bounds (compared with UCB 2) is not new and the improvement is small but it is still a state-of-the-art theoretical result. 

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): One major comment is on the organization of this paper: The full information section seems not very relevant. It would be better to have a centralized list (e.g. a table) of the current algorithms and their regret bounds, anytimeness and bounds after doubling trick.  I am not very sure about the significance of this result and whether it is interesting enough to the community. 

=====

Review #2
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): The paper provides anytime (no knowledge of horizon T) algorithm with optimal distribution independent (or problem independent) bounds and optimal single parameter dependent bound. The algorithm is an extension of MOSS algorithm by Audibert & Bubeck 2010. 

Clarity - Justification: The paper is well written. Problem is well defined and results clearly stated.

Significance - Justification: While it is useful to have anytime algorithm, there are many existing anytime algorithms for MAB, including UCB1 and Thompson Sampling. In that light, the question is what improvement does it provide over those existing algorithms?  Comparison to MOSS: The anytime algorithm is a very small extension of MOSS  algorithm by Audibert & Bubeck 2010m thus the algorithmic ideas are not new.  The regret bounds achieved by MOSS-anytime are same as MOSS. Main contribution above MOSS is that MOSS-anytime does not need knowledge of time horizon.  Comparison to other anytime algorithms: Another anytime algorithm is Thompson Sampling, which achieves \sqrt{KTlog(K)} distribution independent regret bounds and optimal distribution dependent regret bounds. (refer to Agrawal, Goyal AISTATS 2013). The Thompson Sampling algorithm is arguably as simple as the MOSS-anytime algorithm proposed by the authors. In comparison, MOSS-anytime achieves \sqrt{KT} distribution independent bounds (improvement of \sqrt{log(K)), but suffers on the distribution dependent bounds (achieves only optimal *single* parameter dependent regret bounds).   Therefore, in my opinion the contributions of the paper are a small increment over existing algorithms.

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The paper seems technically correct, my rating is based mainly on my concerns regarding the significance of the results, which I have explained above.

=====

Review #3
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): The paper proposes an extension of MOSS algorithm for stochastic MAB problems that enjoys anytime distribution-dependent and distribution-free regret guarantees. The latter are minimax optimal, and have the right dependence in parameters K and \Delta_min when considering the distribution-dependent bound. 

Clarity - Justification: Some arguments and results remain unclear to me. - For the derivation of Theorem 2, the choice of k_0 above the statement of the theorem is unclear; - The assumptions made on the reward distributions are unclear. For example, the KL divergence needs that the distributions are absolutely continuous w.r.t. to some measure on [0,1], right? See Lemma 2. - Lemma 2: I am not sure it holds for all classes of distributions. See e.g. Honda et al. paper.

Significance - Justification: To my knowledge, the proposed algorithm constitutes the first algorithm for MAB problems with such performance guarantees, and this seems like an interesting contribution.

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The paper is essentially well written and presented. It provides the first  algorithm with such anytime regret guarantees. Theoretically, the result is interesting and worth a publication at ICML.  There is however one important related paper that is not cited nor discussed: [1] “The Best of Both Worlds: Stochastic and Adversarial Bandits” by Bubeck and Slivkins, COLT 2012. This paper proposes regret guarantees that resemble those considered in the present paper. It seems that the latter improves over the analysis presented in [1], but the authors should discuss the relationship between the two papers.  

=====