Follow-the-Perturbed-Leader for Decoupled Bandits: Best-of-Both-Worlds and Practicality
Chaiwon Kim ⋅ Jongyeong Lee ⋅ Min-hwan Oh
Abstract
We study the decoupled multi-armed bandit problem, where the learner selects one arm for exploration and one arm for exploitation separately at each round. In this setting, the loss of the explored arm is observed but not incurred, whereas the loss of the exploited arm is incurred without being observed. We propose an efficient Follow-the-Perturbed-Leader (FTPL) policy that achieves Best-of-Both-Worlds (BOBW) guarantee with constant regret in the stochastic regime and optimal $\mathcal{O}(\sqrt{KT})$ regret in the adversarial regime. A key feature of our method is that it completely avoids both the convex optimization required by prior BOBW policy, and the resampling procedures that are typically used in FTPL bandit policies. This allows FTPL to fully realize its computational efficiency advantages, and thus leads to substantial reductions in computational cost. We empirically confirm that our policy not only improves the runtime but also demonstrates superior regret performance in both regimes.
Successful Page Load