We thank all the reviewers for their thoughtful comments.$ Rev. 1 Firstly, we will clarify the statement of Theorem 2 by emphasizing that L depends on δ and n and the regret bound and budget constraint hold for all time steps n. The problem and algorithm we describe are only superficially simple, and we think only in hindsight. We discovered that many equally "obvious" algorithms fail for non-obvious reasons, but decided to devote space to discussing different settings and the variants that work. For example, we considered your suggestion (the highest UCB amongst the "safe" arms becomes the conservative arm) and it indeed maintains the budget constraint, but may fail because being optimistic doesn't grow the budget enough to explore other arms. In contrast, the variant outlined in Remark 4 does work, because its conservative choice grows the budget by at least αμ₀ with high probability. However, as we argue, it can only improve the minimax bound by a constant factor. Since submission we performed additional experiments showing the improvements to be very modest and will include the results in any final version. As we point out in "Previous Work" (lines 216-223), the algorithms of Lattimore (2015) have severe deficiencies compared to ours: 1) they don't maintain the constraint at every time step; 2) the constraint only holds in expectation, not with high probability; and 3) they explicitly use the desired regret bounds thus rely heavily on knowing the payoff of the default action. We have two reasons to constrain the mean payoff rather than the actual payoffs. The first is that constraining the means allows us to measure the skill of the algorithm by ignoring the noise. This is essentially the same as the justification for using the pseudo-regret in the stochastic bandit setting. The second (related) reason is that the manager has already accepted the (unavoidable) noise in the returns, which have been observed in the past while using the default action. That said, a straightforward modification to our algorithm's definition of the budget allows it to satisfy the constraint in terms of actual payoff, just as in the adversarial case. The cost in regret has an extra additive term of Δ₀L/(αμ₀)². We will add this result to any published version. We find the adversarial case intriguing precisely because of the gap in regret compared to the stochastic setting; understanding and closing this gap (through improvements in either analysis or algorithms) would be interesting future work. We feel that including the adversarial case along with all the stochastic variants gives a more comprehensive view of the problem and solution space. Finally, our proofs about the extra cost of maintaining the budget have a novel character that may be unique in the bandit literature: they analyze algorithms that 1) work in two very different "phases" which 2) strongly interact with each other. Indeed, our results illustrate that the main challenge in this problem lies in the careful balance between the budget-building and exploration phases. Rev. 2 Thank you for your kind words. We were indeed motivated by practical situations, and believe that our work is only the first step in this direction. Rev. 3 As mentioned above, we were motivated to work on this problem by practical considerations. To our surprise, most of the "obvious" algorithms we came up with (e.g. BudgetFirst and the reviewer's suggestion above) did not quite work, sometimes for very subtle reasons. The results we present, even though they appear superficially obvious, incorporate the solutions to many difficulties that only became apparent when we attempted to formalize our intuitions. The main challenge is to carefully balance the budget-building with the exploration: the budget must be accumulated safely, but also quickly enough to support sufficient exploration. This balance is reflected in the proofs, which are rather novel in considering the two very different "phases" of the algorithm that, moreover, cannot be decoupled from each other. We will certainly highlight these challenges further in any future revision to the paper. The magical loglog in the regret arises from a peeling device; we will expand upon the intuition behind it. Indeed, we will give a more detailed proof by restating a theorem of Garivier (2013) and giving a reduction to it. We will also experimentally demonstrate the improvement over the simple choice in line 496. Note: as described in §3.4, to bound expected regret we set δ to 1/n, which gives the usual log(n) bound for bandit problems. The intuition for the lower bound was regrettably left out of the paper for lack of space. We agree that it is illuminating and will include it. The challenging case is when the optimal arm is only a little better than the default, and all other arms are a little worse. The proof hinges on carefully balancing all the involved values so the required quantities appear.