On the Sample Complexity of Reinforcement Learning with a Generative Model
We consider the problem of learning the optimal action-value function in the discounted-reward Markov decision processes (MDPs). We prove a new PAC bound on the sample-complexity of model-based value iteration algorithm in the presence of the generative model, which indicates that for an MDP with $N$ state-action pairs and the discount factor $\gamma\in[0,1)$ only $O\left(N\log(N/\delta)/\left((1-\gamma)^3\epsilon^2\right)\right)$ samples are required to find an $\epsilon$-optimal estimation of the action-value function with the probability $1-\delta$. We also prove a matching lower bound of $\Theta \left(N\log(N/\delta)/\left((1-\gamma)^3\epsilon^2\right)\right)$ on the sample complexity of estimating the optimal action-value function by every RL algorithm. To the best of our knowledge, this is the first matching result on the sample complexity of estimating the optimal (action-)value function in which the upper bound matches the lower bound of RL in terms of $N$, $\epsilon$, $\delta$ and $1-\gamma$. Also, both our lower bound and our upper bound significantly improve on the state-of-the-art in terms of $1/(1-\gamma)$.

## Discussion

Email notifications of comments are sent to authors.
Please use the feedback page to report broken links and other problems.
blog comments powered by Disqus