Timezone: »

 
Poster
Model-Free Reinforcement Learning: from Clipped Pseudo-Regret to Sample Complexity
Zhang Zihan · Yuan Zhou · Xiangyang Ji

Tue Jul 20 09:00 AM -- 11:00 AM (PDT) @ Virtual
In this paper we consider the problem of learning an $\epsilon$-optimal policy for a discounted Markov Decision Process (MDP). Given an MDP with $S$ states, $A$ actions, the discount factor $\gamma \in (0,1)$, and an approximation threshold $\epsilon > 0$, we provide a model-free algorithm to learn an $\epsilon$-optimal policy with sample complexity $\tilde{O}(\frac{SA\ln(1/p)}{\epsilon^2(1-\gamma)^{5.5}})$ \footnote{In this work, the notation $\tilde{O}(\cdot)$ hides poly-logarithmic factors of $S,A,1/(1-\gamma)$, and $1/\epsilon$.} and success probability $(1-p)$. For small enough $\epsilon$, we show an improved algorithm with sample complexity $\tilde{O}(\frac{SA\ln(1/p)}{\epsilon^2(1-\gamma)^{3}})$. While the first bound improves upon all known model-free algorithms and model-based ones with tight dependence on $S$, our second algorithm beats all known sample complexity bounds and matches the information theoretic lower bound up to logarithmic factors.

Author Information

Zhang Zihan (Tsinghua University)
Yuan Zhou (UIUC)
Xiangyang Ji (Tsinghua University)

Related Events (a corresponding poster, oral, or spotlight)

More from the Same Authors