Paper ID: 582 Title: Conservative Bandits Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper focuses on minimizing regret in a multi-armed bandit scenario under the constraint that the cumulative expected reward must, at all stage, be greater than (1-epsilon) times the cumulative reward of a given fixed actions. Clarity - Justification: This paper is rather clear and easy to follow. The main difficulty lies in the reading of the main statement (for instance, in Theorem 2 if feels like the regret is uniformly bounded in T). Significance - Justification: This is yet another paper on bandit with a small additional constraint (and closely related to a paper of Lattimore. And the algorithm (and proofs) are very classical: With proba 1-\delta, the expected rewards lies at all stages within their confidence interval, then apply a deterministic algo (namely choose the arm with the highest upper-bound, on the condition that the constraint will not be violate, otherwise use the default arm). By the way, I do not understand why the conservative algo is not : choose the arm with the highest upper-bound amongst those whose lower bound will not violated the constraint. Is it equivalent to the one introduced in Remark 4 (whose first point is not really convincing) ? The adversarial section is also not really intriguing. Moreover, I am not convinced at all by the motivating story that justifies the model. If a manager wants to use bandit under the condition of having a certain amount of cash, then the constraint should not be written in expectation but with respect to the realizations (as written in Equation (3)). I am not sure that, in that case, it is equivalent to the condition in expectation. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): I already made detailed comments in the previous sections. The paper is rather incremental and the algorithms introduced and the proofs are also rather common. ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper studies K-armed bandits (both stochastic and adversarial) under the constraint that at any point in time the total reward accumulated so far has to be at least a fraction of the reward accumulated by a default arm. The paper presents a UCB style algorithm for stochastic bandits and a reduction from a generic adversarial bandit algorithm to an algorithm that respects the budget constraint. These come with problem-dependent as well as worst-case guarantees that differ from those of classical K-armed bandits by an additive term. Authors also show an essentially matching lower bound for the worst-case. Clarity - Justification: The paper is very clear. Significance - Justification: This is definitely a nice contribution--the problem is of practical significance and the result extends previous theory. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Overall, I enjoyed reading this paper. The algorithms are fairly natural and the theoretical implications are well explained. Given that the budgeted bandits are of practical importance, I expect this will be of interest to the ICML community. ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper is on multi-armed bandits (MAB), with a goal to minimize the (cumulative) regret. There is a novel constraint: the algorithm must also do well compared to a default option (modeled as arm 0). Namely, in each round, the cumulative reward of the algorithm up to this round must be within (1-alpha) factor of that of arm 0, for some parameter alpha. The (expected) reward of arm 0 does not need to be known a priori. Both IID and adversarial versions are considered. For both versions, the paper defines natural versions of the standard algorithms, and analyzes their regret. A lower bound is also provided. The regret bound for the IID case is essentially optimal in the worst case. For the adversarial case, the regret bound is optimal for constant alpha; there is a gap in terms of the dependence on alpha (inverse linear vs inverse quadratic). Also, there are some simulations against some other algorithms. The solution for the adversarial case is builds on an algorithm for adversarial bandits with a sublinear, high-probability regret bound. In fact, it is a general "template" that works with any such algorithm. Relation to prior work: similar problem for full feedback has been studied in several papers; for bandit feedback, prior work only considered constraint for a given time horizon, whereas here the constraint is for all rounds. Clarity - Justification: The paper is clear, well-written, and looks mature (e.g., it carefully various variants of the problem and the solutions). Significance - Justification: Motivation: in many applications, one needs to explore without causing much harm. The extra requirement in this paper is that the "no-harm" condition holds for all rounds (rather than at a specific checkpoint). The provided motivation is that a company who is doing exploration may need continuous cash flow to survive. An additional motivation that I can think of is that providing bad service to customers during some initial segment of time may ruin the company, too. So I think the problem is reasonably well-motivated. Pros: the problem is well-motivated, the algorithms are reasonable and elegant, the regret bounds make sense and are nearly optimal, the simulations look very nice for a mostly theoretical paper. Also, the paper is clear, well-written, and looks mature. Cons: the problem, while certainly not trivial, just does not look very challenging. The high-level approaches are among the few natural ideas that an expert in the field would come up with, and filling in the implementation and deriving the analysis would not be that complicated, either. The lower bound proof looks trickier, as lower bounds often are. However, a slightly weaker lower bound has already been proved in prior work (Lattimore 2015a), which reduces the significance. Namely, in prior work the expected reward of arm 0 is not known to the algorithm, whereas in this paper is it known. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): some other comments: - It would help to highlight what is conceptually or technically challenging in this paper. - the improved confidence term in Eq. (9) has only log log() dependence on #trials, which seems a little magical. It would be useful to provide some intuition for why this is possible. A proof in the full version would be good too, to make the paper self-contained. You could also comment on how much does this improvement saves in terms of regret, compared to the "simple choice" in L496. - An intuition for the lower bound would be useful. In particular, the problem instance ensemble used in the lower bound, and the intuition for why it works, can often be described quite succinctly (even when the actual proof is quite tedious). =====