Skip to yearly menu bar Skip to main content


Talk
in
Workshop: Theoretical Foundations of Reinforcement Learning

Short Talk 1 - Crush Optimism with Pessimism: Structured Bandits Beyond Asymptotic Optimality

Kwang-Sung Jun


Abstract:

We study regret minimization in stochastic structured bandits. The fact that the popular optimistic algorithms do not achieve the asymptotic instance-dependent regret optimality has recently allured researchers. On the other hand, it is known that one can achieve a bounded regret (i.e., does not grow indefinitely with ) in certain instances. Unfortunately, existing asymptotically optimal algorithms rely on forced sampling that introduces an term w.r.t. the time horizon in their regret, failing to adapt to the ``easiness'' of the instance. In this paper, we focus on the finite hypothesis class case and ask if one can achieve the asymptotic optimality while enjoying bounded regret whenever possible. We provide a positive answer via a new algorithm called CRush Optimism with Pessimism (CROP). Our analysis shows that CROP achieves a constant-factor asymptotic optimality and, thanks to the forced-exploration-free design, adapts to bounded regret, and its regret bound scales not with the number of arms but with an effective number of arms that we introduce. We also show that CROP can be exponentially better than existing algorithms in the \textit{nonasymptotic} regimes. Finally, we observe that even a clairvoyant oracle who plays according to the asymptotically optimal arm pull scheme may suffer a linear worst-case regret, indicating that it may not be the end of optimism. We believe our work may inspire a new family of algorithms for bandits and reinforcement learning.

Kwang-Sung Jun, Chicheng Zhang

Chat is not available.