A Direct Approach for Handling Contextual Bandits with Latent State Dynamics
Abstract
We revisit the finite-armed linear bandit model by Nelson et al. [2022], where contexts and rewards are governed by a finite hidden Markov chain. Nelson et al. [2022] approach this model by a reduction to linear bandits, but relies on a simplification in which rewards are linear functions of the posterior probabilities over the hidden states given the observed contexts, rather than functions of the hidden states themselves, and assumes knowledge of the HMM parameters. We instead study the more natural model incorporating direct dependencies in the hidden states (on top of dependencies on the observed contexts, as is natural for contextual bandits) and also target stronger, high-probability, regret bounds for a fully adaptive strategy that estimates HMM parameters online.