Optimal and Efficient Dynamic Regret Algorithms for Non-Stationary Dueling Bandits

Aadirupa Saha · Shubham Gupta

Hall E #1007

Keywords: [ MISC: Online Learning, Active Learning and Bandits ] [ T: Online Learning and Bandits ]

Abstract: We study the problem of \emph{dynamic regret minimization} in $K$-armed Dueling Bandits under non-stationary or time-varying preferences. This is an online learning setup where the agent chooses a pair of items at each round and observes only a relative binary `win-loss' feedback for this pair sampled from an underlying preference matrix at that round. We first study the problem of static-regret minimization for adversarial preference sequences and design an efficient algorithm with $O(\sqrt{KT})$ regret bound. We next use similar algorithmic ideas to propose an efficient and provably optimal algorithm for dynamic-regret minimization under two notions of non-stationarities. In particular, we show $\tO(\sqrt{SKT})$ and $\tO({V_T^{1/3}K^{1/3}T^{2/3}})$ dynamic-regret guarantees, respectively, with $S$ being the total number of `effective-switches' in the underlying preference relations and $V_T$ being a measure of `continuous-variation' non-stationarity. These rates are provably optimal as justified with matching lower bound guarantees. Moreover, our proposed algorithms are flexible as they can be easily `blackboxed' to yield dynamic regret guarantees for other notions of dueling bandits regret, including condorcet regret, best-response bounds, and Borda regret. The complexity of these problems have not been studied prior to this work despite the practicality of non-stationary environments. Our results are corroborated with extensive simulations.

Chat is not available.