Response$========
We thank all reviewers for their helpful and detailed comments.
Reviewer 1
----------
We certainly agree that a practical implementation of the algorithm will involve some issues, but as mentioned by the reviewer, this is a relatively new, difficult setting, and we do make fewer assumptions than previous work. For example, the shared clock assumption is weaker than previous "shared knowledge" assumptions such as an auction mechanism or a centralized control. Also, we note that the fixing stage (where players choose an arm randomly until no collision occurs) is over quite quickly, so it does not lead to a significant increase in regret. In any case, we will make sure these issues are fairly conveyed in the paper.
Reviewer 3
----------
We thank the reviewer for his comments. They all indeed refer to typos which we shall fix.
Reviewer 4
----------
We thank the reviewer for his very helpful comments, and as suggested, we will add a more in-depth discussion of the issues raised. We will also correct the typos mentioned in the review. Below are some specific responses.
Time horizon and Delta_min: First of all, we would like to clarify that our focus is regret bounds which hold in expectation, *conditioned* on some high-probability event, where the probability parameter delta is a *user-defined* parameter. We note that the results of Avner and Mannor (2014) actually provide the same kinds of guarantees, so our results are comparable to theirs (e.g. in their theorem 1, the expectation is actually conditional on all users knowing the correct ranking of the best arms -- see top of pg. 7 in the arxiv version). This is different than bounds on the unconditional expected regret, which indeed (as the reviewer points out) would require tuning delta based on the horizon T.
Extension for time-varying number of players: Here, T is indeed a parameter that the algorithm receives, and we indeed mention this as one of the assumptions that will be important to lift in future work (possibly using doubling tricks). Regarding the minimal gap, as mentioned above, we consider it as a given quantity rather than something to be tuned based on an expected regret bound. In any case, knowing a lower bound on the gap is also sufficient for our results to hold. In any case, we agree that there is room for a more in-depth discussion about these issues, and we will add it.
High-level proof sketches: Thanks for the suggestion, we will indeed add these.
Line 197: Thanks for catching this, we will fix accordingly.
Line 249: Indeed we did not refer to that event. This is very unlikely, but possible, as you mentioned. In the event where N^* > K we use N^* = K (as in the event of C_T0 = T0). We will add this note to the algorithm.
Line 401: The probability of success is determined by T0, the number of pure exploration rounds at each epoch, which is given to the algorithm as a parameter. It would have also been possible to consider delta as the parameter, and derive T0 based on it, but since the algorithm actually uses T0 rather than delta, it seemed more natural to us to consider T0 as the parameter.
Line 577: The MEGA algorithm does not give any regret guarantees for the dynamic setting (except for a scenario where a single player leaves after the system has stabilized, and they show that the added regret due to the player leaving is order of t^{beta}). We note that the requirement for alpha to satisfy the bound in Theorem 3 is a technical requirement that comes from the analysis, and not because MEGA has some provable bounds for higher values. Regarding a bound for the DMC algorithm for a different/suboptimal choice of number of epochs, we didn't write down a formal analysis, but in principle this can certainly be done by plugging different values into the regret bound in lemma 4.
Line 674: We will fix the typo. The decreasing trend of the regret can be seen more clearly perhaps in Appendix B figure 6, which considers a similar scenario with a longer time horizon.