Skip to yearly menu bar Skip to main content

Workshop: Workshop on Reinforcement Learning Theory

Estimating Optimal Policy Value in Linear Contextual Bandits beyond Gaussianity

Jonathan Lee · Weihao Kong · Aldo Pacchiano · Vidya Muthukumar · Emma Brunskill

Abstract: While much of bandit learning focuses on minimizing regret while learning an optimal policy, it is often of interest to estimate the maximum value achievable before learning the optimal policy, which can be of use as an input to downstream tasks like model selection. Prior work in contextual bandits has considered this in the Gaussian setting. Here, we study the problem of approximating the optimal policy value in the more general linear contextual bandit problem, and we focus on whether it is possible to do so with less data than what is needed to learn the optimal policy. We consider two objectives: (1) estimating upper bounds on the value and (2) estimating the value directly. For the first, we present an adaptive upper bound that is at most logarithmic factor larger than the value and tight when the data is Gaussian and show that it is possible to estimate this upper bound in $\widetilde{\mathcal{O}}( \sqrt{d} )$ samples where $d$ is the number of parameters. As a consequence of this bound, we show improved regret bounds for model selection. For the second objective, we present a moment-based algorithm for estimating the optimal policy value with sample complexity $\widetilde{ \mathcal{O}}( \sqrt{d} )$ for sub-Gaussian context distributions whose low order moments are known.

Chat is not available.