Paper ID: 917 Title: BISTRO: An Efficient Relaxation-Based Method for Contextual Bandits Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper considers the problem of learning in contextual bandit problems where the losses may be generated by an adversary, but the contexts are generated in an iid fashion. The main contribution is an algorithm that is computationally efficient in the sense that it only requires polynomially many calls to an empirical risk-minimization oracle. The authors show that the regret of the resulting algorithms grows as O(T^{3/4}) as a function of the time horizon T. The main technical tool used by the authors is a new extension of the successful relaxation-based online learning framework for partial-information problems. Clarity - Justification: The paper is written well for the most part, although the length of the sections are not always proportional to their respective importance. In particular, I find Sections 6.1-6.4 to be a bit drawn-out; I think some text here should be sacrificed to give room for a high-level sketch of the proof of Lemma 1 and Theorem 2 in the main text. Section 6.5, however, is very interesting on its own right and deserves a better placement within the paper. Otherwise, the paper is enjoyable to read and the technical quality is excellent. Significance - Justification: The contextual-bandit problem received a lot of attention in recent years and the current paper contributes an important step towards finding an efficient solution. The main technical tool developed by the authors itself seems very useful and has a great potential to inspire future work in online learning with partial feedback. The contextual-bandit algorithm itself and its analysis are also very clean and elegant. The extensions for data-dependent policy classes are also interesting, although much less so than the main result. The obvious downside of the new results is the scaling of the regret bound with the time horizon T. I don't see this as a very serious problem, since the result is still the best known bound for computationally efficient algorithms for contextual bandit learning. The authors also do a good job in explaining where the apparent looseness of the bound may stem from, although it would be probably better if these bits of discussion would all appear in the same section (at the moment, Sections 5 and 7 both offer possible explanations). In particular, the authors argue that the sub-optimality may be partly attributed to the uniform exploration step used by the algorithm. I wonder if the "implicit exploration" technique of Kocak et al. (2014) (and Neu, 2015) could be used instead of uniform exploration to achieve better results. As for the setting itself, I have one important concern. Namely, the authors allow the losses to be chosen arbitrarily by the environment, but it's never made quite clear if they can be actually chosen as an arbitrary function of the interaction history and the contexts or that they should be regarded as being fixed in advance. Notice that in the contextual setting with iid contexts, the standard arguments relating the above two cases do not apply: If all losses are fixed in advance, then they are obviously statistically independent from the contexts, so the learner has no hope to improve its performance by taking the contexts into account. On the other hand, this is not the case if the losses in round t are allowed to depend on the context in round t. I do believe that the analysis in the paper actually applies to this latter case, but maybe it's worth stating this strength more explicitly. Overall, I think that the paper is very strong and definitely deserves to be published after some slight restructuring and adding a better discussion of the above issues. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): 181: This open question is rather vague: is it about the proposed algorithm or about the best achievable regret in general? If it's the latter, then a partial answer is given by Beygelzimer et al. (2011), who propose an inefficient algorithm that has O(\sqrt{T}) regret against the best policy in a VC class. 242: The notation M_f for denoting matrices clashes with the notation M_t that is used to refer to columns of a matrix M. 258: "within \delta from the minimum"---maybe a *multiplicative factor* of \delta? 307: Please mention that the proof is actually in the paper. 341: The actual regret bound in terms of the time horizon is never actually stated as a theorem. I think that several potential readers might miss the point that the Rademacher average appearing under the square root is expected to grow as \sqrt{n} for several interesting function classes. 477: At this point, the meaning of "constraints" is quite fuzzy. Are they supposed to be some sort of inequality constraints on some real-valued functions? 555: "representation the corresponding" -> "representation of the corresponding" 813: "loseness" 834: "do denote" -> "to denote" 1001: "we now reason conditionally on x_t"---only on x_t or the entire past? 1006: The scopes of the suprema are not indicated in the entire proof, which makes the formulas a bit difficult to parse. 1009: isn't the second equality supposed to be an inequality? This decoupling of q'_t may be loose here too and not only in Eq. (22). 1113: Here, the scope of the expectations of the suprema are missing, which again makes the statements difficult to follow. 1124: Please denote the scope of the expectations and reorder the terms---right now, this passage is very confusing although still apparently correct once the reader figures out where the parens are supposed to be. References: Kocak, Neu, Valko and Munos: "Efficient learning by implicit exploration in bandit problems with side observations". In NIPS 2014. Neu: "Explore no more: Improved high-probability regret bounds for non-stochastic bandits". In NIPS 2015. ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper studies the contextual bandits problem in a setting where the contexts are i.i.d. and the rewards are arbitrary (possibly adversarial). The authors present an algorithm that attains T^{3/4} regret with a logarithmic dependence on the number of benchmark policies, which is also computationally efficient provided an oracle for the corresponding batch ERM problem. The authors also discuss extensions to allow (multiplicative-)approximation algorithms for ERM, and for dealing with adversarial contexts. Clarity - Justification: The paper clearly motivates the problem, discusses the relevant previous work, and states the main contributions. On the other hand, the technical sections are much harder to follow: * The authors do not provide any intuition or explanations to why their proposed approach work, nor give a proof sketch or an overview of the main result; the proofs themselves are very technical and quite hard to follow. * Some preliminaries on relaxations and the idea behind them are warranted for making the paper readable. * The "Setup" section mainly discusses notations and few preliminaries, and does not describe the setting in a self-contained way. The authors may wish to rearrange the material in a more coherent way. Significance - Justification: The authors consider the contextual bandits problem in (to my knowledge) a novel, very reasonable and clearly motivated setting, and the assumptions made are natural. The regret rate established is perhaps inferior to that exhibited in previous work, but is proven under much weaker assumptions and via a novel efficient algorithm that makes non-trivial use of unlabeled data. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The results established in the paper are, to the best of my knowledge, novel and make a substantial progress in a highly active line of research. That said, the current state of the paper did not allow me to verify the correctness of the main results, nor to appreciate the techniques used to prove them (hence the lower score and confidence). I think the authors have to provide a proof-sketch of their main theorem (Thm 2) in the body of the paper, or at least to give some intuition or an informal overview of the "magic" that enables their results, as the proofs themselves are quite involved and technical. I also have a more specific concern regarding the technical writing. The definitions of admissible relaxation and admissible strategy, and consequently the proofs of Lem 1 and Thm 2, do not seem to take into account the internal randomization involved in the choice of q_t. On top of that, due to the bandit feedback, q_t also depends on the previous randomized decisions made by the player (unlike in the full-feedback case), and this dependence is glossed over already in the basic definitions. This is particularly unfortunate because the novelty of the algorithm and one of the main points of the paper is the randomized choice of q_t based on hallucination of future rounds (i.e., sampling of unlabeled data), and this randomization should be treated carefully. While I believe these issues do not translate into any serious problem in the analysis, they certainly made it harder for me to follow the proofs and verify the correctness of the approach. Some specific comments: * The definition of an admissible strategy did not make sense to me: it is defined as a strategy that "certifies the inequalities (5) and (6)", but those inequalities are independent of a specific strategy and are required to hold for *all* strategies. * Is it obvious why a delta-approximate ERM oracle inflicts only O(n*delta) penalty in the cumulative regret (see lines 412-414)? It seems that the errors propagate back into the feedback and indirectly affect future decisions made by the algorithm. * Line 102, "the sequence of costs ... can even be chosen adaptively and adversarially": to my understanding, the relaxation lemma (Lemma 1) is valid only for an oblivious choice of the c_t (otherwise, the c_t are random variables and should be treated accordingly). Do you have an extension of the lemma that applies to adaptive costs? * Thm 2: please mention, perhaps in a corollary, that for finite classes F the "matrix Rademacher complexity" is bounded by root(n*log|F|) and state the implied regret bound. This is, after all, the main result of the paper. * Definition 1: it might be helpful to introduce some special notation for the ERM oracle, so that it would be clear where it is being used in the algorithm. * The approach of reducing online to batch/ERM by hallucinating future rounds (as well as the resulting T^{3/4} bound) is strikingly similar to the ideas of Kakade & Kalai, "From Batch to Transductive Online Learning", and I think the latter paper should be cited here. Additional comments/typos: * Line 270: "the expectation of the ERM objective" * Line 288: "partial-information" * Eq (5): c_t => c_t(\hat{y}_t) * Line 455: "if define regret" => "even if we define the regret" ? * Line 870: subscripts of E and sup on the RHS should be n and not t. ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper presents a novel ERM-based algorithm for contextual bandit problem with iid contexts but adversary rewards, in a transductive setting where contexts are known beforehand or can be drawn from a source. The algorithm is based on relaxations (Rakhlin et al., 2012) and random playout (Rakhlin & Sridharan, 2015), and requires O(n) ERM calls in total over n rounds, with regret O(n^{3/4}). The paper also discusses several extensions, such as approximation for the ERM, data-dependent policy classes, etc. Clarity - Justification: In general the paper is well written, except that there are a few places that are too vague in some sense, especially in Sec 6. I will give some examples in "Detailed comments". Significance - Justification: Contextual bandit is an important online learning problem with many applications. The algorithm presented in this work is quite simple and the analysis is also relatively concise compared to previous work, which improves the current understanding and opens up new direction for this problem in my opinion. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): 1. Beside the weaker regret bound, another main drawback of this work (which I don't think it's mentioned actually) compared to Agarwal et al., 2014 is the larger number of ERM calls, i.e. O(n) compared to O(\sqrt(n)). Is there any way to improve this? 2. The random playout idea is also studied in by Cesa-Bianchi and Shamir, "Efficient Transductive Online Learning via Randomized Rounding", 2013. 3. Further discussion of the regret bound in Theorem 2 is needed. For example, when F is finite, why is the Rademacher term of order O(\sqrt{n}). Same comment for the bound in Lemma 5. 4. For the optimization-based relaxations, it's not clear how to make it practical when F is not the set of all possible labelings, which is the most interesting case. 5. Line 736, the statement "If a problem is online learnable in the full information…": more precisely, it should be something like "if a problem is relaxation-based online learnable…"? 6. Some typos: 1) Line 270: extra "a" in "nothing but a (negative of) the…". Also, to be precise, it should be the "average" ERM objective with the random matrix. 2) Line 455: "if define" -> "if we define"? 3) Line 555: "for the matrix representation the corresponding f", missing an "of"? =====