Skip to yearly menu bar Skip to main content

Workshop: Workshop on Reinforcement Learning Theory

Randomized Least Squares Policy Optimization

Haque Ishfaq · Zhuoran Yang · Andrei Lupu · Viet Nguyen · Lewis Liu · Riashat Islam · Zhaoran Wang · Doina Precup

Abstract: Policy Optimization (PO) methods with function approximation are one of the most popular classes of Reinforcement Learning (RL) algorithms. However, designing provably efficient policy optimization algorithms remains a challenge. Recent work in this area has focused on incorporating upper confidence bound (UCB)-style bonuses to drive exploration in policy optimization. In this paper, we present Randomized Least Squares Policy Optimization (RLSPO) which is inspired by Thompson Sampling. We prove that, in an episodic linear kernel MDP setting, RLSPO achieves $\Tilde{\mathcal{O}}(d^{3/2} H^{3/2} \sqrt{T})$ worst-case (frequentist) regret, where $H$ is the number of episodes, $T$ is the total number of steps and $d$ is the feature dimension. Finally, we evaluate RLSPO empirically and show that it is competitive with existing provably efficient PO algorithms.

Chat is not available.