Poster
in
Workshop: Decision Awareness in Reinforcement Learning
Exploration Hurts in Bandits with Partially Observed Stochastic Contexts
Hongju Park · Mohamad Kazem Shirani Faradonbeh
Contextual bandits are widely-used models in reinforcement learning for incorporating both common and idiosyncratic factors in reward functions. The existing approaches rely on full observation of the stochastic context vectors, while the problem of learning optimal arms from partially observed contexts remains immature. We theoretically show that in the latter setting, decisions can be made more guarded to minimize the risk of pulling sub-optimal arms. More precisely, efficiency is established for Greedy policies that treat the estimates of the unknown parameter and of the unobserved contexts as their true values. That includes non-asymptotic worst-case regret bounds that grow (poly-)logarithmically with the time horizon and failure probability, and linearly with the number of arms. Numerical results that showcase the efficacy of avoiding exploration are provided.