Timezone: »

Provably Efficient Exploration in Policy Optimization
Qi Cai · Zhuoran Yang · Chi Jin · Zhaoran Wang

Tue Jul 14 07:00 AM -- 07:45 AM & Tue Jul 14 06:00 PM -- 06:45 PM (PDT) @ None #None
While policy-based reinforcement learning (RL) achieves tremendous successes in practice, it is significantly less understood in theory, especially compared with value-based RL. In particular, it remains elusive how to design a provably efficient policy optimization algorithm that incorporates exploration. To bridge such a gap, this paper proposes an \underline{O}ptimistic variant of the \underline{P}roximal \underline{P}olicy \underline{O}ptimization algorithm (OPPO), which follows an ``optimistic version'' of the policy gradient direction. This paper proves that, in the problem of episodic Markov decision process with unknown transition and full-information feedback of adversarial reward, OPPO achieves an $\tilde{O}(\sqrt{|\cS|^2|\cA|H^3 T})$ regret. Here $|\cS|$ is the size of the state space, $|\cA|$ is the size of the action space, $H$ is the episode horizon, and $T$ is the total number of steps. To the best of our knowledge, OPPO is the first provably efficient policy optimization algorithm that explores.

Author Information

Qi Cai (Northwestern University)
Zhuoran Yang (Princeton University)
Chi Jin (Princeton University)
Zhaoran Wang (Northwestern U)

More from the Same Authors