ICML 2021 Model-Free Reinforcement Learning: from Clipped Pseudo-Regret to Sample Complexity Spotlight

Spotlight

Model-Free Reinforcement Learning: from Clipped Pseudo-Regret to Sample Complexity

Zhang Zihan · Yuan Zhou · Xiangyang Ji

[ Abstract ] [ Visit Reinforcement Learning 1 ] [ Paper ]

[ Paper ]

Abstract: In this paper we consider the problem of learning an

ϵ

$\epsilon$ -optimal policy for a discounted Markov Decision Process (MDP). Given an MDP with

S

$S$ states,

A

$A$ actions, the discount factor

γ \in (0, 1)

$\gamma \in (0,1)$ , and an approximation threshold

ϵ > 0

$\epsilon > 0$ , we provide a model-free algorithm to learn an

ϵ

$\epsilon$ -optimal policy with sample complexity

\tilde{O} (\frac{S A \ln (1 / p)}{ϵ^{2} (1 - γ)^{5.5}})

$\tilde{O}(\frac{SA\ln(1/p)}{\epsilon^2(1-\gamma)^{5.5}})$ \footnote{In this work, the notation

\tilde{O} (\cdot)

$\tilde{O}(\cdot)$ hides poly-logarithmic factors of

S, A, 1 / (1 - γ)

$S,A,1/(1-\gamma)$ , and

1 / ϵ

$1/\epsilon$ .} and success probability

(1 - p)

$(1-p)$ . For small enough

ϵ

$\epsilon$ , we show an improved algorithm with sample complexity

\tilde{O} (\frac{S A \ln (1 / p)}{ϵ^{2} (1 - γ)^{3}})

$\tilde{O}(\frac{SA\ln(1/p)}{\epsilon^2(1-\gamma)^{3}})$ . While the first bound improves upon all known model-free algorithms and model-based ones with tight dependence on

S

$S$ , our second algorithm beats all known sample complexity bounds and matches the information theoretic lower bound up to logarithmic factors.

Chat is not available.