Timezone: »

Poster
Improved Regret Bounds for Thompson Sampling in Linear Quadratic Control Problems
Marc Abeille · Alessandro Lazaric

Wed Jul 11 09:15 AM -- 12:00 PM (PDT) @ Hall B #198
Thompson sampling (\ts) is an effective approach to trade off exploration and exploration in reinforcement learning. Despite its empirical success and recent advances, its theoretical analysis is often limited to the Bayesian setting, finite state-action spaces, or finite-horizon problems. In this paper, we study an instance of \ts in the challenging setting of the infinite-horizon linear quadratic (LQ) control, which models problems with continuous state-action variables, linear dynamics, and quadratic cost. In particular, we analyze the regret in the frequentist sense (i.e., for a fixed unknown environment) in one-dimensional systems. We derive the first $O(\sqrt{T})$ frequentist regret bound for this problem, thus significantly improving the $O(T^{2/3})$ bound of~\citet{abeille2017thompson} and matching the frequentist performance derived by~\citet{abbasi2011regret} for an optimistic approach and the Bayesian result of~\citet{ouyang2017learning-based}. We obtain this result by developing a novel bound on the regret due to policy switches, which holds for LQ systems of any dimensionality and it allows updating the parameters and the policy at each step, thus overcoming previous limitations due to lazy updates. Finally, we report numerical simulations supporting the conjecture that our result extends to multi-dimensional systems.