Understanding the Gaps in Satisficing Bandits
Chloé Rouyer ⋅ Ronald Ortner ⋅ Peter Auer
Abstract
We study a variant of the stochastic multi-armed bandit problem in which the learner aims to identify and play an arbitrary arm whose mean reward exceeds a known satisficing threshold $S$, rather than optimizing against the best arm. Prior work has shown that when such a satisficing arm exists, time-independent bounds on the satisficing regret are achievable, but these guarantees deteriorate when an arm lies close to the threshold. We focus on instances in which the excess gap $\Delta_*$ (gap between the best arm and the threshold) is small relative to the suboptimality gaps $\Delta_i$, a regime that exposes this limitation. To capture this challenge, we introduce a refined notion of regret and propose a new algorithm, uncertain-UCB, which achieves *satisficing* pseudo-regret of $ O \left(\sum_{i: \Delta_i > \Delta_*} \frac{\log(K/\Delta_*)}{\Delta_i}\right), $ while recovering standard pseudo-regret bounds when no arm exceeds the threshold. Further, we establish a near-matching lower bound in the small excess-gap regime, showing that any algorithm incurs at least $ \Omega \left(\sum_{i: \Delta_i > \Delta_*} \frac{\log \left(\left(\sum_{i:\Delta_i > \Delta_*} \Delta_*/\Delta_i\right)^{-1}\right)}{\Delta_i}\right) $ satisficing pseudo-regret.
Successful Page Load