## Provably Efficient Exploration in Policy Optimization

### Qi Cai · Zhuoran Yang · Chi Jin · Zhaoran Wang

Keywords: [ Non-convex Optimization ] [ Reinforcement Learning Theory ] [ Optimization - Non-convex ]

[ Abstract ]
Tue 14 Jul 7 a.m. PDT — 7:45 a.m. PDT
Tue 14 Jul 6 p.m. PDT — 6:45 p.m. PDT

Abstract: While policy-based reinforcement learning (RL) achieves tremendous successes in practice, it is significantly less understood in theory, especially compared with value-based RL. In particular, it remains elusive how to design a provably efficient policy optimization algorithm that incorporates exploration. To bridge such a gap, this paper proposes an \underline{O}ptimistic variant of the \underline{P}roximal \underline{P}olicy \underline{O}ptimization algorithm (OPPO), which follows an optimistic version'' of the policy gradient direction. This paper proves that, in the problem of episodic Markov decision process with unknown transition and full-information feedback of adversarial reward, OPPO achieves an $\tilde{O}(\sqrt{|\cS|^2|\cA|H^3 T})$ regret. Here $|\cS|$ is the size of the state space, $|\cA|$ is the size of the action space, $H$ is the episode horizon, and $T$ is the total number of steps. To the best of our knowledge, OPPO is the first provably efficient policy optimization algorithm that explores.