Processing math: 100%
Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Workshop on Reinforcement Learning Theory

Nearly Minimax Optimal Reinforcement Learning for Discounted MDPs

Jiafan He · Dongruo Zhou · Quanquan Gu


Abstract: We study the reinforcement learning problem for discounted Markov Decision Processes (MDPs) under the tabular setting. We propose a model-based algorithm named UCBVI-γ, which is based on the \emph{optimism in the face of uncertainty principle} and the Bernstein-type bonus. We show that UCBVI-γ achieves an ˜O(SAT/(1γ)1.5) regret, where S is the number of states, A is the number of actions, γ is the discount factor and T is the number of steps. In addition, we construct a class of hard MDPs and show that for any algorithm, the expected regret is at least ˜Ω(SAT/(1γ)1.5). Our upper bound matches the minimax lower bound up to logarithmic factors, which suggests that UCBVI-γ is nearly minimax optimal for discounted MDPs.

Chat is not available.