Skip to yearly menu bar Skip to main content


Spotlight

Tightening the Dependence on Horizon in the Sample Complexity of Q-Learning

Gen Li · Changxiao Cai · Yuxin Chen · Yuantao Gu · Yuting Wei · Yuejie Chi

Abstract: Q-learning, which seeks to learn the optimal Q-function of a Markov decision process (MDP) in a model-free fashion, lies at the heart of reinforcement learning. Focusing on the synchronous setting (such that independent samples for all state-action pairs are queried via a generative model in each iteration), substantial progress has been made recently towards understanding the sample efficiency of Q-learning. To yield an entrywise ε-accurate estimate of the optimal Q-function, state-of-the-art theory requires at least an order of |S||A|(1γ)5ε2 samples in the infinite-horizon γ-discounted setting. In this work, we sharpen the sample complexity of synchronous Q-learning to the order of |S||A|(1γ)4ε2 (up to some logarithmic factor) for any 0<ε<1, leading to an order-wise improvement in 11γ. Analogous results are derived for finite-horizon MDPs as well. Notably, our sample complexity analysis unveils the effectiveness of vanilla Q-learning, which matches that of speedy Q-learning without requiring extra computation and storage. Our result is obtained by identifying novel error decompositions and recursion relations, which might shed light on how to study other variants of Q-learning.

Chat is not available.