Poster
On the Sample Complexity of Learning Infinite-horizon Discounted Linear Kernel MDPs
Yuanzhou Chen · Jiafan He · Quanquan Gu
Hall E #807
Keywords: [ RL: Function Approximation ] [ RL: Online ] [ RL: Discounted Cost/Reward ]
Abstract:
We study reinforcement learning for infinite-horizon discounted linear kernel MDPs, where the transition probability function is linear in a predefined feature mapping. Existing UCLK \citep{zhou2020provably} algorithm for this setting only has a regret guarantee, which cannot lead to a tight sample complexity bound. In this paper, we extend the uniform-PAC sample complexity from episodic setting to the infinite-horizon discounted setting, and propose a novel algorithm dubbed UPAC-UCLK that achieves an \TildeO(d2/((1−γ)4ϵ2)+1/((1−γ)6ϵ2))\TildeO(d2/((1−γ)4ϵ2)+1/((1−γ)6ϵ2)) uniform-PAC sample complexity, where dd is the dimension of the feature mapping, γ∈(0,1)γ∈(0,1) is the discount factor of the MDP and ϵϵ is the accuracy parameter. To the best of our knowledge, this is the first ˜O(1/ϵ2)~O(1/ϵ2) sample complexity bound for learning infinite-horizon discounted MDPs with linear function approximation (without access to the generative model).
Chat is not available.