Timezone: »

Randomized Least Squares Policy Optimization
Haque Ishfaq · Zhuoran Yang · Andrei Lupu · Viet Nguyen · Lewis Liu · Riashat Islam · Zhaoran Wang · Doina Precup
Policy Optimization (PO) methods with function approximation are one of the most popular classes of Reinforcement Learning (RL) algorithms. However, designing provably efficient policy optimization algorithms remains a challenge. Recent work in this area has focused on incorporating upper confidence bound (UCB)-style bonuses to drive exploration in policy optimization. In this paper, we present Randomized Least Squares Policy Optimization (RLSPO) which is inspired by Thompson Sampling. We prove that, in an episodic linear kernel MDP setting, RLSPO achieves $\Tilde{\mathcal{O}}(d^{3/2} H^{3/2} \sqrt{T})$ worst-case (frequentist) regret, where $H$ is the number of episodes, $T$ is the total number of steps and $d$ is the feature dimension. Finally, we evaluate RLSPO empirically and show that it is competitive with existing provably efficient PO algorithms.

#### Author Information

##### Haque Ishfaq (MILA / McGill University)

I am a first-year Ph.D. student in the Montreal Institute for Learning Algorithms (MILA) at McGill University, where I am advised by Professor Doina Precup. Before coming to McGill, I obtained my M.S. degree in Statistics and B.S. degree in Mathematical and Computational Science from Stanford University.