Paper ID: 65
Title: Multi-Player Bandits -- a Musical Chairs Approach
Review #1
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): The paper considers the problem of online learning in stochastic multi-player bandit problems where the goal of the players is to maximize their total rewards while avoiding "collisions" that occur when multiple players pick the same arm in the same round. The authors provide two main results: (1) an algorithm that works for a fixed number of players and guarantees "constant" regret with high probability and (2) another algorithm for a changing number of players that guarantees sublinear regret. Besides the theoretical results, the authors also run several illustrative simulations comparing their algorithm to the previous state-of-the-art method.
Clarity - Justification: The paper is very enjoyable to read and the technical quality appears to be very good, although I didn't have the capacity to check all the proofs in the 13-page supplementary material. I think that the paper could possibly benefit from providing high-level proof sketches for the main results, probably at the expense of the lengthy discussion of the MEGA algorithm in Section 4. (One minor comment: The proofs in the appendix offer an intimidating level of detail, meticulously spelling out every step of elementary algebra. This certainly helps the reproducibility of the results, but makes it very difficult to efficiently check whether the results are morally correct or not. So, please, consider adding a sketch for each proof.)
Significance - Justification: In a sense, the results do improve the state of the art in the problem considered, although I would add a few more remarks about the limitations. One crucial detail that the authors never seem to mention in the paper is that their algorithm crucially requires the time horizon T to be fixed, unlike the results of Avner and Mannor (2014) who proved bounds that hold for asymptotically growing T. Indeed, "constant" regret bounds that hold at a fixed confidence level make little sense in the asymptotic regime where the confidence has to increase as the stakes go up (see, e.g., a discussion in Abbasi-Yadkori, Pal and Szepesvari, NIPS 2011). In fact, it is well known that in order to achieve asymptotically optimal regret, learning algorithms have to sample suboptimal arms infinitely many times (cf. the classic results by Lai and Robbins). Thus, since the results are only naturally applicable to the finite-horizon setting, it would be necessary to state the worst-case regret bound of the algorithm for finite T, which seems to be of O(T^{2/3}), thanks to the bound's scaling with 1/Delta_min^2. That is, I think it's unfair of the authors to claim a huge improvement over Avner and Mannor (2014) as they currently do. The extension for time-varying number of players is an important new result, although it also seems to suffer from some of the same problems as the basic algorithm: the bound is stated for a fixed (and in this case, also *known*) time horizon T, but also depends on the minimal reward gap in a rather complicated way. As above, the authors should explicitly state the worst-case regret bound to enable a better comparison with other results. (My rough calculations show a bound of O(T^{5/6}) at best.) All these complaints aside, the proposed algorithm is indeed very elegant and much simpler than all previous methods. I appreciate the detailed discussion of the algorithm of Avner and Mannor (2014) and the experiments also tell an interesting story. Once the authors add a more in-depth discussion about the above issues, the paper should be suitable for publication.
Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): 068: extra "below" 197: Shouldn't \mu_k(t) be \mu_k instead? Also, maybe \mu_{j(t)} would be a better notation instead of \mu_j(t)? 249: It's not obvious to me that N^* is always bounded by K. What happens if it happens to be larger than K (however unlikely that seems)? 355: extra "epochs" 401: Although this result holds with "arbitrarily high probability", please mention that this probability should be part of the input to the algorithm. 507: As I've mentioned above, MEGA seems to do the right thing by never stopping exploration, and fixing each players' choice may actually be harmful when T grows large. 577: Does MEGA have a performance guarantee of any nature for this choice of \alpha? Can you prove a sublinear bound for DMC when the number of epochs is not so luckily chosen? 674: "dotted"->"dashed". Also, I don't see quite clearly the decreasing trend of DMC's regret. 737: "zero"->"one".
=====
Review #2
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): This paper focuses on multi-armed bandit problem where there are not one but many decision makers and, if they take the same decisions, they collide and get no reward. An algorithm that the decision maker should follow is provided with explicit regret bounds (namely, bounded regret).
Clarity - Justification: I found this paper is well written and the claims are justified.
Significance - Justification: The results somehow improve previous results of Avner and Mannor and thus is rather incremental. Moreover, even if the setting is clearly a simplified version of a possible relevant problem, the proposed algorithm seems really not applicable in practice, and this is a classical argument against learning algorithm in multiplayer games. Asking them to play randomly until they find an equilibrium (i.e., an allocation without collision) seems inefficient and not really relevant (and, as usual, they should share a common clock in a dynamical framework).
Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): I raised some general criticisms against learning in multi-player scenario, but this is a new, difficult framework. And, with respect to the aforementioned issues, the this paper is still interesting but rather incremental.
=====
Review #3
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): This paper studied a new variant of the multi-armed bandit problem. Unlike the traditional model, the new model has more than one non-communicating players. In each round, a player chooses an arm which she wants to explore, but receives the reward only if there is no other player choosing the same arm too (otherwise, she receives an indication of a collision). The goal is to upper bound the expected regret against the solution in which for each round, we assign each player to one of the top N arms (w.r.t. to their expected means) if there are N players left in that round. Note that the goal here is to maximize the total reward of all the players. Then two cases are studied. In the static case, the set of players doesn't not change, and they provide an algorithm which w.h.p. achieves a constant regret (it doesn't depend on T). It significantly improves the previous known algorithm with regret bound O(T^{2/3}). In the dynamic case, some players can enter or leave the stage during the game. For this case, they provide an algorithm w.h.p. achieves a regret of O(\sqrt{xT}), in which x denotes the total number of players have ever entered the game. In that case, no algorithm with non-trivial regret bound is known before. They also discuss in length some problematic scenarios in which the previous algorithm called MEGA would fail, and demonstrate the effectiveness of their algorithm on those scenarios. In addition, they provide further experimental justification for their algorithm's performance. To achieve their regret bound, in the static case, they basically divide the algorithm into two phases. In the first phase, the players just randomly explore all the arms. Assuming a gap between the N-th largest arm and N+1-th largest arm, then after a prescribed number of rounds, we can ensure that every player identify the set of the N largest arms and infer the exact value of N w.h.p., they then condition on that event. In the second phase, using a so called Musical Chairs procedure, w.h.p., each player can settle down with one of the largest N arms without any further collision after a certain number of rounds. Afterwards, there would be no further regret, making the regret bounded by a constant. In the dynamic case, they need to use a synchronized clock. They divide all the T rounds into a number of blocks, for each block we just restart the algorithm for static case. The final regret bound is followed by choosing the optimal number of blocks.
Clarity - Justification: The paper is fairly easy to follow.
Significance - Justification: Overall I think it is an interesting work. The model they studied is natural and may have practical significance in cognitive radio. The improvement over the previous bound is also significant.
Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Here are some minor comments: Line 103: "x is a bound on the total number of players exiting and leaving." entering and leaving. Line 176: "Generally, we assume K, N ,and N_t are all much smaller than T" remove the space after N. In algorithm 1(MC): you use C_t in the analysis but C_{T_0} in the pseudo-code, which is inconsistent. Theorem 1: where does \delta_2 in the expression come from? It only appears in the appendix. Lemma 1: It seems you use $\epsilon-$correct rather than $\epsilon$-correct (clearly the later looks better). Another comment on Lemma 1: Actually in the setting all arms are supported in [0,1]. Hence \epsilon \le 1, which means the second term in the max expression always dominate the first term, maybe you can just remove the first term? Line 354: "to synchronize the epochs epochs", remove one epochs. Line 415: "The bound in the lemma hides", lemma -> theorem Lemma 4: should be i \le N_m as the definition in Theorem 2. Line 1144: better use \exp rather than e^? Line 1233: "Using the events, A and B as defined", remove the comma after events Line 1244: "if one knows know N" Line 1248: "x \ge [0,1]": \ge -> \in
=====