Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Models of Human Feedback for AI Alignment

Adversarial Multi-dueling Bandits

Pratik Gajane

[ ] [ Project Page ]
Fri 26 Jul 8 a.m. PDT — 8 a.m. PDT

Abstract: We introduce the problem of regret minimization in adversarial multi-dueling bandits. While adversarial preferences have been studied in dueling bandits, they have not been explored in multi-dueling bandits. In this setting, the learner is required to select $m \geq 2$ arms at each round and observes as feedback the identity of the most preferred arm, based on an arbitrary preference matrix, possibly chosen adversarially. We introduce a novel algorithm, MIDEX, to learn from such preference feedback that is assumed to be generated from a pairwise-subset choice model. We prove that the expected cumulative $T$-round regret of MIDEX compared to a Borda-winner from a set of $K$ arms is upper bounded by $O((K \log K)^{1/3} T^{2/3})$. Moreover, we prove a lower bound of $\Omega(K^{1/3} T^{2/3})$ for the expected regret in this setting which demonstrates that our proposed algorithm is near-optimal.

Chat is not available.