Timezone: »

Non-stationary Bandits and Meta-Learning with a Small Set of Optimal Arms
MohammadJavad Azizi · Thang Duong · Yasin Abbasi-Yadkori · Claire Vernade · András György · Mohammad Ghavamzadeh
We study a sequential decision problem where the learner faces a sequence of $K$-armed stochastic bandit tasks. The tasks may be designed by an adversary, but the adversary is constrained to choose the optimal arm of each task in a smaller (but unknown) subset of $M$ arms. The task boundaries might be known (the bandit meta-learning setting), or unknown (the non-stationary bandit setting). We design an algorithm based on a reduction to bandit submodular maximization, and show that, in the regime of large number of tasks and small number of optimal arms, its regret in both settings is smaller than the simple baseline of $\tilde{O}(\sqrt{KNT})$ that can be obtained by using standard algorithms designed for non-stationary bandit problems. For the bandit meta-learning problem with fixed task length $\tau$, we show that the regret of the algorithm is bounded as $\tilde{O}(NM\sqrt{M \tau}+N^{2/3}M\tau)$. Under additional assumptions on the identifiability of the optimal arms in each task, we show a bandit meta-learning algorithm with an improved $\tilde{O}(N\sqrt{M \tau}+N^{1/2}\sqrt{M K \tau})$ regret.

Author Information

MohammadJavad Azizi (University of Southern California)
Thang Duong (VinAI Research, VinUniversity)
Yasin Abbasi-Yadkori (DeepMind)
Claire Vernade (DeepMind)
András György (DeepMind)
Mohammad Ghavamzadeh (Google Research)

More from the Same Authors