ICML 2021 Provably Efficient Reinforcement Learning for Discounted MDPs with Feature Mapping Spotlight

Spotlight

Provably Efficient Reinforcement Learning for Discounted MDPs with Feature Mapping

Dongruo Zhou · Jiafan He · Quanquan Gu

[ Abstract ] [ Visit Reinforcement Learning Theory 2 ] [ Paper ]

[ Paper ]

Abstract: Modern tasks in reinforcement learning have large state and action spaces. To deal with them efficiently, one often uses predefined feature mapping to represent states and actions in a low dimensional space. In this paper, we study reinforcement learning for discounted Markov Decision Processes (MDPs), where the transition kernel can be parameterized as a linear function of certain feature mapping. We propose a novel algorithm which makes use of the feature mapping and obtains a

\tilde{O} (d \sqrt{T} / (1 - γ)^{2})

$\tilde O(d\sqrt{T}/(1-\gamma)^2)$ regret, where

d

$d$ is the dimension of the feature space,

T

$T$ is the time horizon and

γ

$\gamma$ is the discount factor of the MDP. To the best of our knowledge, this is the first polynomial regret bound without accessing a generative model or making strong assumptions such as ergodicity of the MDP. By constructing a special class of MDPs, we also show that for any algorithms, the regret is lower bounded by

Ω (d \sqrt{T} / (1 - γ)^{1.5})

$\Omega(d\sqrt{T}/(1-\gamma)^{1.5})$ . Our upper and lower bound results together suggest that the proposed reinforcement learning algorithm is near-optimal up to a

(1 - γ)^{- 0.5}

$(1-\gamma)^{-0.5}$ factor.

Chat is not available.