Poster
in
Workshop: Models of Human Feedback for AI Alignment
Adversarial Multi-dueling Bandits
Pratik Gajane
Abstract:
We introduce the problem of regret minimization in adversarial multi-dueling bandits. While adversarial preferences have been studied in dueling bandits, they have not been explored in multi-dueling bandits. In this setting, the learner is required to select m≥2 arms at each round and observes as feedback the identity of the most preferred arm, based on an arbitrary preference matrix, possibly chosen adversarially. We introduce a novel algorithm, MIDEX, to learn from such preference feedback that is assumed to be generated from a pairwise-subset choice model. We prove that the expected cumulative T-round regret of MIDEX compared to a Borda-winner from a set of K arms is upper bounded by O((KlogK)1/3T2/3). Moreover, we prove a lower bound of Ω(K1/3T2/3) for the expected regret in this setting which demonstrates that our proposed algorithm is near-optimal.
Chat is not available.