Spotlight
Adversarial Dueling Bandits
Aadirupa Saha · Tomer Koren · Yishay Mansour
Abstract:
We introduce the problem of regret minimization in Adversarial Dueling Bandits. As in classic Dueling Bandits, the learner has to repeatedly choose a pair of items and observe only a relative binary `win-loss' feedback for this pair, but here this feedback is generated from an arbitrary preference matrix, possibly chosen adversarially.
Our main result is an algorithm whose TT-round regret compared to the \emph{Borda-winner} from a set of K items is ˜O(K1/3T2/3), as well as a matching Ω(K1/3T2/3) lower bound. We also prove a similar high probability regret bound.
We further consider a simpler \emph{fixed-gap} adversarial setup, which bridges between two extreme preference feedback models for dueling bandits: stationary preferences and an arbitrary sequence of preferences. For the fixed-gap adversarial setup we give an ˜O((K/Δ2)logT) regret algorithm, where Δ is the gap in Borda scores between the best item and all other items, and show a lower bound of Ω(K/Δ2) indicating that our dependence on the main problem parameters K and Δ is tight (up to logarithmic factors). Finally, we corroborate the theoretical results with empirical evaluations.
Chat is not available.