Paper ID: 995 Title: Efficient Algorithms for Adversarial Contextual Learning Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper provides the first oracle efficient sublinear regret algorithms based on Follow the Perturbed Leader family for adversarial versions of the contextual bandit problem. Two settings are analyzed: 1) transductive setting in which the learner knows the set of contexts a priori; 2) small separator setting, in which there exists a small set of contexts such that any two policies behave differently in one of the contexts in the set. Several extensions are provided, such as switching regret and efficient learning with predictable sequences. Clarity - Justification: The paper is very clear and easy to follow. The proofs are sound and seem correct, although I was only able to check the proof on the first few pages on the supplementary materials (up to B.4). Significance - Justification: This paper contains interesting results giving for the first time efficient algorithms for general contextual online learning problem in the adversarial setting, including the adversarial contextual semi-bandit linear optimization. I think the contribution to the theory of online learning is strong. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Overall, I liked the paper and found the results interesting. The algorithm is simple to implement as it only requires to sample d random additional observations and have access to the optimization oracle which can efficiently solve the corresponding off-line optimization problem. This is often the case in combinatorial problems considered in the literature. Unfortunately, not being an expert in the field of semi-bandit optimization, I guess I cannot fully judge the novelty of the results. To compare the bounds with the special case on non-contextual (d=1) combinatorial optimization, it would be good to re-express N in terms of the size of the action set. In the worst case, N = O(m log K), and then the bound in Theorem 2, point (2), becomes O(m^{7/4} sqrt(T log K)), a factor of m^{1/4} worse than that of Neu and Bartok (2013), which is also based on generating exponentially-distributed noise variables. A similar factor of m^{1/4} is lost comparing to their bound for context-free semi-bandit combinatorial setting. Of course, the bounds Neu and Bartok (2013) are context-free and hence less general, but I am curious how did this additional factor m^{1/4} appear in the bound and whether the analysis could be improved. ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper proposes new learning algorithms for contextual bandit problems that guarantee sublinear regret in the case of non-stochastic loss sequences, given that the learner has prior access to either (i) the complete set of contexts to occur during learning (called the transductive setting) or (ii) a small set of contexts that can be used to tell policies apart (the "small-separator" setting). The algorithms guarantee regret bounds of O(T^{3/4}) in the transductive setting and O(T^{2/3} d^{3/4}) in the small-separator setting, where T is the number of rounds and d is the size of the set of "separator" contexts. The proposed algorithms are based on the "Follow-the-Perturbed-Leader" (FTPL) principle and can be implemented efficiently given access to an optimization oracle for the underlying offline optimization problem. Clarity - Justification: The paper is written very well and the technical quality is very good for the most part. Significance - Justification: The general problem of efficient learning in contextual bandit problems is arguably very important and the authors' approach is novel and interesting. As far as I know, the new algorithms are indeed among the first ones for non-stochastic contextual bandits based on polytime reductions to empirical risk minimization. The regret bounds themselves are not very strong, but the novelty of the approach makes up for this problem. I have to say though that I have found the assumptions made by the authors a bit surprising and possibly quite impractical. In particular, I am not fully convinced that the "transductive setting" is really meaningful for online learning scenarios, or contextual bandits in particular. For applications in online advertising, this assumption essentially requires the learner to know the features of *all* future users to be encountered, which is very restrictive given that in such problems, even the time horizon T is typically unknown to the learner. Could the authors give a concrete (at least semi-)practical example where this assumption could be verified? The "small-separator" setting seems to be more practical, although identifying a small separator for a policy class seems like a very hard computational problem on its own. The authors also offer some additional results concerning competing with the best sequence of policies and path-length bounds. In particular, the authors claim that their approach leads to the first efficient algorithm with strong tracking-regret guarantees. However, I don't think that this is true: Gyorgy et al. (2012) seems to also achieve similar regret guarantees while being computationally more efficient. As for the path-length bounds, they don't add much value to the paper, but I don't mind their inclusion. Another result claimed by the authors is that competing with VC classes in a non-transductive setting is, in a sense, hard. This result has been known before: it has been known for quite some time that the VC dimension is not the appropriate measure for quantifying the hardness of online learnability (see, e.g., Ben-David, Pal and Shalev-Shwartz, COLT 2009). The hardness result proved here (Theorem 16) should be contrasted with the margin-error bounds of Ben-David et al. (their Corollary 6). Also, the authors should also relate this result to the positive result of Beygelzimer et al (2011). Had it not been for this issue, I think that the paper should be worth publishing, as it brings a new flavor into contextual bandit learning and has the potential to inspire future work. I hope that the authors can address the above issues in the final version of their manuscript. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): 146: "ours runs in polynomial time"---polynomial in what? Please specify. 176: The work of Awerbuch and Kleinberg doesn't seem to be very relevant here. Either remove this reference or include other works concerning combinatorial bandits. 207: "cumulative utility"->"cumulative loss" or "cumulative cost" 260: The full-information FTPL algorithm described here seems to be closely related to the random-playout algorithm of Rakhlin & Sridharan (NIPS 2012). Can you discuss this relationship? 386: Maybe it should be K^d instead of d^K? Either way, |\Pi| is not necessarily bounded by K^d or d^K. It's probably better to say that for the purpose of analysis, the effective number of policies is at most K^d, while the actual policy class may be much larger. 418: This notation for the l-infinity norm is rather unusual (as it is not used in conjunction with the 1-norm as the primal norm). 434: Technically, this result may apply for uniform distributions over [-c,c] too. 512: Is there a polytime optimization oracle for this policy class? 526: In this case, a more reasonable goal is to compete with the set of *all* linear classifiers. The approach presented here, however, doesn't seem to guarantee any meaningful regret guarantee against this class as the data distribution may be such that all classifiers have a 50% error rate, even though there is a perfect separator with 100% accuracy. It appears that the algorithm of Beygelzimer et al. (2011) manages to overcome this problem. Can the authors comment on this? 595: Is there any technical reason to sample J^t(j) independently for each j? The analysis of Neu and Bartok (2013) also works for dependent samples. 665: "the algorithm requires mL oracle calls"---only in the worst case, as the resampling procedure may terminate earlier. 703: "I(t) is a selector which maps a time-step t to a policy \pi"---maybe it maps time indices to policy *indices* and not policies? 741: What is "arginf"? 1062: missing parens. 1044: "is bounded is bounded" 1240: "\pi(x)(j)"->"\pi(x)(j)=1" 1435: Note that the proof of this lemma borrows several elements from Neu and Bartok (2013), e.g., the definition of the sparse loss vector \ell_\pi and the associated distribution p_\pi, or the inequality in line 1564. 1449: The overloading of z makes this proof a bit difficult to follow. 1571: Note that this proof basically repeats the same steps as the proof of Lemma 9. 1700: "Using the fact that expected estimates are upper bounded by true losses"---I don't see where this was used. Maybe you wanted to replace E[\hat\ell] by \ell in the first term on the rhs? 1765: What "norm inequality" is used here? It simply looks like Jensen's inequality. 1812: Same mistake as in line 1700. But the inequality above (line 1807) seems to be stronger anyway. 1879-1890: You can save all these calculations by using E[J^2] \le L*E[J]. 1899: "The second lemma is an analogue of our multiplicative stability lemma"---it is actually an analogue of the additive stability lemma 9. 2127: What is the definition of * here? 2206: What is d here? Ben-David, Pál and Shalev-Shwartz: Agnostic Online Learning. In COLT 2009. Beygelzimer, Langford, Li, Reyzin and Schapire: Contextual bandit algorithms with supervised learning guarantees. In AISTATS 2011. Gyorgy, Linder and Lugosi: Efficient tracking of large classes of experts. In IEEE transactions on Information Theory, 2012. Rakhlin, Shamir and Sridharan: Relax and Randomize: From Value to Algorithms. In NIPS 2012. ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper introduces a new variant of Follow the Perturbed Leader and derives adversarial regret bounds for several contextual prediction settings, both for full-information and semi-bandit feedback. The novelty compared to previous work in these settings is that the new algorithm is computationally efficient even if the policy space is huge, assuming only that we have access to an efficient oracle that finds the best fixed policy for a given sequence of losses. This is in contrast to earlier work where computational costs may scale linearly in the number of policies, or very specific assumptions are made about the structure of the policy space. As a drawback, the regret bounds in the current paper are often somewhat weaker than those obtained by computationally inefficient methods. Clarity - Justification: The paper proceeds in a clear and logical manner, explaining the setting, introducing the algorithm and main results, and studying their implications in a variety of specific settings. The text is easy to follow. However I did not read the additional material, which includes all the proofs. Significance - Justification: The results in the paper achieve a nice level of generality. The ways used to limit the space of policies and contexts (transductive and small separator settings) seem interesting, being some kind of a compromise between just looking at the number of policies, or making domain-specific assumptions. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Do the results generalise to the situation where the oracle performs only approximate optimisation (with some error guarantee)? You mention that it is essential that your algorithm can make also negative disturbances. As you say, negative weights are not a problem for shortest paths in DAGs, but are there other cases where this might be a serious issue for the oracle? Your algorithm for the switching scenario makes O(T^2) oracle calls in total. This is reasonable for this type of result, but still you might mention this when you summarise the results in the introduction instead of just saying the algorithm is efficient. =====