What Does Thompson Sampling Optimize?
Abstract
Thompson Sampling is one of the most widely used and studied bandit algorithms, known for its simple structure, low regret performance, and solid theoretical guarantees. Yet, in stark contrast to most other families of bandit algorithms, the exact mechanism through which posterior sampling (as introduced by Thompson) is able to "properly" balance exploration and exploitation, remains a mystery. In this paper, we show that the core insight to address this question stems from recasting Thompson Sampling as an online optimization algorithm. To distill this, we introduce a time invariant notion of regret that summarizes cumulative regret across horizons (through a regret bound), leading to a time invariant Bellman-optimal policy. It turns out that Thompson Sampling admits an online optimization form that mimics the structure of the Bellman-optimal policy, where greediness is regularized by a measure of residual uncertainty. When viewed through this new lens of online optimization, Thompson Sampling can be understood and improved in a principled manner, by comparing it against the Bellman-optimal benchmark.