ICML 2021 Tightening the Dependence on Horizon in the Sample Complexity of Q-Learning Spotlight

Spotlight

Tightening the Dependence on Horizon in the Sample Complexity of Q-Learning

Gen Li · Changxiao Cai · Yuxin Chen · Yuantao Gu · Yuting Wei · Yuejie Chi

[ Abstract ] [ Visit Reinforcement Learning and Bandits ] [ Paper ]

[ Paper ]

Abstract: Q-learning, which seeks to learn the optimal Q-function of a Markov decision process (MDP) in a model-free fashion, lies at the heart of reinforcement learning. Focusing on the synchronous setting (such that independent samples for all state-action pairs are queried via a generative model in each iteration), substantial progress has been made recently towards understanding the sample efficiency of Q-learning. To yield an entrywise

ε

$\varepsilon$ -accurate estimate of the optimal Q-function, state-of-the-art theory requires at least an order of

\frac{| S | | A |}{(1 - γ)^{5} ε^{2}}

$\frac{|S||A|}{(1-\gamma)^5\varepsilon^{2}}$ samples in the infinite-horizon

γ

$\gamma$ -discounted setting. In this work, we sharpen the sample complexity of synchronous Q-learning to the order of

\frac{| S | | A |}{(1 - γ)^{4} ε^{2}}

$\frac{|S||A|}{(1-\gamma)^4\varepsilon^2}$ (up to some logarithmic factor) for any

0 < ε < 1

$0<\varepsilon <1$ , leading to an order-wise improvement in

\frac{1}{1 - γ}

$\frac{1}{1-\gamma}$ . Analogous results are derived for finite-horizon MDPs as well. Notably, our sample complexity analysis unveils the effectiveness of vanilla Q-learning, which matches that of speedy Q-learning without requiring extra computation and storage. Our result is obtained by identifying novel error decompositions and recursion relations, which might shed light on how to study other variants of Q-learning.

Chat is not available.