We thank the reviewers for their insightful feedback and would like to address some of the main concerns. We will address and incorporate further feedback in the final version.$
Rev 1: Regarding the applicability of transductive online learning, this problem setting has been widely studied in the learning theory community as it is a natural online analogue to Vapnik's original transductive model (c.f. Ben-David et al '97, Kakade-Kalai '05). Certainly developing oracle efficient algorithms for the non-transductive case is an interesting avenue for future work, but we believe that our work, by focusing on the transductive setting, makes significant progress toward this goal.
One concrete application of the transductive setting is personalized content recommendation with subscription users. Such scenarios arise for instance in personalized news or movie services like MSN and Netflix, where contextual bandits are deployed. In these settings, the set of users from the previous day is a good proxy for the set of users that will arrive in the next day, so this can be reasonably modeled as a transductive online problem. Moreover, our algorithm is robust to the case where new, previously unknown, contexts arrive, in the sense that our algorithm will only incur an extra regret of 1 on any such context arrival. Thus if the number of times we observe a new context that we didnâ€™t account for grows sublinearly with the time horizon, then our algorithm still achieves sublinear regret. We can add this comment to the final version. We also note that our results can easily be adapted to an unknown horizon T by a standard doubling trick, with a log(T) multiplicative increase in the regret.
Rev 1: The paper of Gyorgy et al., uses an efficient online learning algorithm as a black box, but as far as we are aware, no efficient online learning algorithm for this problem was known until our result. Thus our result, achieving sublinear tracking policy regret in an optimization oracle-efficient manner, is not implied by the results of Gyorgy et al., as no oracle efficient online learning algorithm existed for the contextual online learning problem. However, we agree the result for the routing problem is implied by Gyorgy et al. In principle, we could invoke the approach in that paper to extend to the switching regret from our fixed action regret. However, we believe that our more explicit construction is also of interest due to its simplicity and potential for extension. Our approach portrays the easiness of adding time as part of the context. It, therefore, has the potential of leading to efficient algorithms for other time-varying policies, assuming that the offline oracle for the implied optimization problem can be computed efficiently (similar to how we show this in Lemma 5). For these two reasons we think that the results here are still a contribution to the literature. We will remove the claim that we are the first to give efficient tracking regret results and cite Gyorgy et al. appropriately.
Rev 1: Regarding Thm 16, we agree this result is well known, hence why we pushed it to the Supplementary material. We thought that presenting it here for the specific formulation of the problem, would make the story more complete to the reader. We will add in appropriate citations and discussion. Note that the VC-result of Beygelzimer et al., only applies to the i.i.d. setting, so it does not contradict our lower bound.
Rev 1: Regarding discretizing linear separators, as we prove in Thm 16, one cannot achieve sublinear regret when competing with the set of all linear classifiers in a fully adversarial non-transductive environment, which is the setting considered in Point 2 of Example 4. Thus a natural and tractable goal is to compete with a discretization of linear separators, as we do in the example.
Rev 1: Regarding the correctness of Thm 16, the theorem holds regardless of the whether the learning algorithm is operating over a discretization or not. The important thing is that the comparator class is not discretized.
Rev 2: Handling an approximate oracle is an interesting direction for future work. In our current approach, if the offline oracle gives an epsilon additive error, then this will add an epsilon*T extra term in the regret. The inability to handle approximate oracles in a better way is also present in prior work on the contextual bandit problem for the i.i.d. setting.
Rev 4: The discrepancy in the bound in comparison with Neu and Bartok can be recovered by a more careful analysis of the error term in the non-contextual case. On line 1251, our proof chooses lambda = epsilon/2\sqrt{dm} which is a suboptimal choice for minimizing the previous expression. A better choice is \lambda = \min\{epsilon/2, epsilon \sqrt{\log(N)/(2dm)}\}. This setting resolves the discrepancy, and we can incorporate this into our final result. We primarily presented the simpler bound for convenience in presentation.