Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Workshop on Reinforcement Learning Theory

Near-Optimal Offline Reinforcement Learning via Double Variance Reduction

Ming Yin · Yu Bai · Yu-Xiang Wang


Abstract: We consider the problem of offline reinforcement learning (RL) --- a well-motivated setting of RL that aims at policy optimization using only historical data. In this paper, we propose \emph{Off-Policy Double Variance Reduction} (OPDVR), a new variance reduction based algorithm for offline RL. Our main result shows that OPDVR provably identifies an ϵ-optimal policy with O~(H2/dmϵ2) episodes of offline data in the finite-horizon \emph{stationary transition} setting and this improves over the best known upper bound by a factor of H. Moreover, we establish an information-theoretic lower bound of Ω(H2/dmϵ2) which certifies that OPDVR is optimal up to logarithmic factors. Lastly, we show that OPDVR also achieves rate-optimal sample complexity under alternative settings such as the finite-horizon MDPs with non-stationary transitions and the infinite horizon MDPs with discounted rewards.

Chat is not available.