Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Workshop on Reinforcement Learning Theory

Near-Optimal Offline Reinforcement Learning via Double Variance Reduction

Ming Yin · Yu Bai · Yu-Xiang Wang


Abstract: We consider the problem of offline reinforcement learning (RL) --- a well-motivated setting of RL that aims at policy optimization using only historical data. In this paper, we propose \emph{Off-Policy Double Variance Reduction} (OPDVR), a new variance reduction based algorithm for offline RL. Our main result shows that OPDVR provably identifies an $\epsilon$-optimal policy with $\widetilde{O}(H^2/d_m\epsilon^2)$ episodes of offline data in the finite-horizon \emph{stationary transition} setting and this improves over the best known upper bound by a factor of $H$. Moreover, we establish an information-theoretic lower bound of $\Omega(H^2/d_m\epsilon^2)$ which certifies that OPDVR is optimal up to logarithmic factors. Lastly, we show that OPDVR also achieves rate-optimal sample complexity under alternative settings such as the finite-horizon MDPs with non-stationary transitions and the infinite horizon MDPs with discounted rewards.

Chat is not available.