Abstract:
Learning with expert advice and multi-armed bandit are two classic online decision problems which differ on how the information is observed in each round of the game. We study a family of problems interpolating the two. For a vector , an instance of -MAB indicates that the arms are partitioned into groups and the -th group contains arms. Once an arm is pulled, the losses of all arms in the same group are observed. We prove tight minimax regret bounds for -MAB and design an optimal PAC algorithm for its pure exploration version, -BAI, where the goal is to identify the arm with minimum loss with as few rounds as possible. We show that the minimax regret of -MAB is and the minimum number of pulls for an -PAC algorithm of -BAI is . Both our upper bounds and lower bounds for -MAB can be extended to a more general setting, namely the bandit with graph feedback, in terms of the *clique cover* and related graph parameters. As consequences, we obtained tight minimax regret bounds for several families of feedback graphs.
Chat is not available.