ICML Discuss
On the Sample Complexity of Reinforcement Learning with a Generative Model
by Mohammad Gheshlaghi Azar, Remi Munos, Bert Kappen at ICML 2012
We consider the problem of learning the optimal action-value function in the discounted-reward Markov decision processes (MDPs). We prove a new PAC bound on the sample-complexity of model-based value iteration algorithm in the presence of the generative model, which indicates that for an MDP with $N$ state-action pairs and the discount factor $\gamma\in[0,1)$ only $O\left(N\log(N/\delta)/\left((1-\gamma)^3\epsilon^2\right)\right)$ samples are required to find an $\epsilon$-optimal estimation of the action-value function with the probability $1-\delta$. We also prove a matching lower bound of $\Theta \left(N\log(N/\delta)/\left((1-\gamma)^3\epsilon^2\right)\right)$ on the sample complexity of estimating the optimal action-value function by every RL algorithm. To the best of our knowledge, this is the first matching result on the sample complexity of estimating the optimal (action-)value function in which the upper bound matches the lower bound of RL in terms of $N$, $\epsilon$, $\delta$ and $1-\gamma$. Also, both our lower bound and our upper bound significantly improve on the state-of-the-art in terms of $1/(1-\gamma)$.

Related Material

Download PDF Watch Video

Discussion

Email notifications of comments are sent to authors.
Please use the feedback page to report broken links and other problems.
blog comments powered by Disqus