Processing math: 100%
Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Models of Human Feedback for AI Alignment

Adversarial Multi-dueling Bandits

Pratik Gajane


Abstract: We introduce the problem of regret minimization in adversarial multi-dueling bandits. While adversarial preferences have been studied in dueling bandits, they have not been explored in multi-dueling bandits. In this setting, the learner is required to select m2 arms at each round and observes as feedback the identity of the most preferred arm, based on an arbitrary preference matrix, possibly chosen adversarially. We introduce a novel algorithm, MIDEX, to learn from such preference feedback that is assumed to be generated from a pairwise-subset choice model. We prove that the expected cumulative T-round regret of MIDEX compared to a Borda-winner from a set of K arms is upper bounded by O((KlogK)1/3T2/3). Moreover, we prove a lower bound of Ω(K1/3T2/3) for the expected regret in this setting which demonstrates that our proposed algorithm is near-optimal.

Chat is not available.