Paper ID: 383
Title: Online Learning with Feedback Graphs Without the Graphs
Review #1
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): This paper is about online learning with limited feedback. Each round, the feedback model is specified by a graph. The learner sees the quality of all actions connected to the action he chose. The graph may be different every round. Previous work looked at learning strategies where the graph is revealed in full. This work considers the case where the graphs are only partially revealed. A range of results (lower bound adversaries and upper bound algorithms) are presented. Both for stochastic and adversarial losses, and for reflexive and arbitrary graphs. A complete and pretty picture emerges.
Clarity - Justification: Very well structured. Lots of motivation and explanation. Well done.
Significance - Justification: The questions are natural follow-ups of earlier work on feedback graphs. Yet the paper presents a large and diverse set of results.
Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): I have nothing to add about this paper. Well-motivated, well-written. An interesting breadth of results. Clear accept.
=====
Review #2
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): The paper considers the problem of online learning with side observations specified by feedback graphs, following the model of Mannor and Shamir (2011). Existing results in this problem setting show that if the learner can observe the entire feedback graph, then it is possible to improve the standard regret guarantees of O(\sqrt{KT}) to O(\sqrt{\alpha T}), where K is the number of actions, T is the number of rounds and \alpha is the average independence number of the feedback graphs. In the current paper, the authors show that this improvement is not achievable in general if the learner only receives the side-observations themselves and cannot observe the full feedback graphs. However, the authors also show that the same improvement is achievable under the assumption that the losses are generated in an iid fashion. Finally, the authors also provide negative and positive results about the case when the learner receives strictly less feedback than bandit: here, learning in general is shown to be impossible, though the iid assumption on the losses is shown to enable sublinear regret.
Clarity - Justification: The paper is very enjoyable to read and the technical quality is solid.
Significance - Justification: The problem studied in this paper has gathered quite a bit of attention in recent years, and the current paper shows some interesting limitations of the framework. While these limitations were already apparent in the proofs in the known positive results, it is nice to see formal lower bounds that quantify the importance of observing the feedback graphs. The counterexamples used for proving the results are clean and intuitive, and the proof details are very elegant. The algorithm proposed for stochastic losses is also very intuitive and well-justified by the authors. Overall, I have very little to complain about. Perhaps one clarification about the settings themselves would be necessary: On several occasions, the authors emphasize that "stochastic losses" are what make the problem learnable. This is a bit misleading as the loss sequences in the lower bound constructions are also iid, with the key difference being that they are generated jointly with the observation systems. That is, it is really not the stochasticity of the losses that makes the problem learnable but the statistical independence of the losses from the feedback graphs (which is of course only well-defined if at least one of these objects are random). This issue, while mentioned in Section 2.2 should be made clear at an earlier stage in the paper. Also, this discussion naturally leads to another important question: do you expect the problem to be learnable when the losses are non-stochastic but the graphs are iid?
Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): 423-427: This sentence is so long that it's quite difficult to follow, even though this is the most important detail of the proof. Please rewrite. 437: This statement holds for some specific setting of \epsilon, if I'm not mistaken. Consider stating the theorem for any \epsilon to be consistent with Lemma 5. 505: The symbols \hat{\mu}_r and \tilde{\mu}_r are sometimes difficult to distinguish with the naked eye. Consider introducing some notation that easier to read (e.g., \hat{\mu}_r^*?). 526: You probably mean that AlphaSample returns *at least* one sample? 576: "by the Cauchy-Schwartz inequality"---and also Jensen's? Or Cauchy-Schwartz applied to the inner product E[\sum_i a_i b_i]? 722: This is not Markov's inequality, just bounding X \ge I_{X > 0}. 840: "If it is left with exactly one action..."---This procedure is described in a very confusing way. Why does it give an observation probability of at least 1/2 to v? 1139-1146: This reasoning is rather difficult to follow. Perhaps it would be better to first say that you want to find a bound on x that guarantees that the concave quadratic function takes only negative values for x greater than this bound, and then say what this implies, and then turn this into a bound on P[\alpha(G)>\alpha]. Otherwise this argument is really elegant.
=====
Review #3
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): This paper focuses on regret minimization where when an action is taken the rewards of all its neighbors are observed. The key point is that the graph that defines neighborhood is unknown and might change from time to time. Interestingly, there exists a sequence of graphs of bounded independence number (the key quantity when the graphs are observed) such that the lower bound on regret is of the order of (KT)^{1/2}, i.e., the rate of traditional multi-armed bandit. This answer a question raised after a recent line of work on bandit with graph feedbacks.
Clarity - Justification: The paper uses some notion from online learning and graph theory without defining them properly. But an interested reader can always check wikipedia (or any other book, but let's be honest) to get the relevant knowledge.
Significance - Justification: The contribution is pretty interesting as it solves an open-question on bandit with graph feedback. Do you need to know the graph to take advantage and get improved rates of convergence ? Yes. And this is not trivial. If the graph is known, then one can compute the probability of observing the reward of the action chosen (and then use important sampling techniques) If the graph is unknown, of course, this probability cannot be assessed, but it still could have been possible to get better rates. Unfortunately, it is not the case. The stochastic case is less surprising, and is rather clear that using a variant of elimination-based algorithm (that samples each remaining arm once before making any decision) would still benefit from the graph-based observations. Thus, if all graphs have an independent number of \alpha, getting a sample from each arm will required (roughly) \alpha actions, leading to a regret of the order of \alpha \log(T)/\Delta, instead of d \log(t)/\Delta
Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): My comments were already detailed in the previous sections. The main result (lower bound in the adversarial case) is cute, the other much less impressive (and the proofs of them are really classic). Overall, I enjoyed reading this paper.
=====