Timezone: »

 
Near-Optimal Offline Reinforcement Learning via Double Variance Reduction
Ming Yin · Yu Bai · Yu-Xiang Wang
We consider the problem of offline reinforcement learning (RL) --- a well-motivated setting of RL that aims at policy optimization using only historical data. In this paper, we propose \emph{Off-Policy Double Variance Reduction} (OPDVR), a new variance reduction based algorithm for offline RL. Our main result shows that OPDVR provably identifies an $\epsilon$-optimal policy with $\widetilde{O}(H^2/d_m\epsilon^2)$ episodes of offline data in the finite-horizon \emph{stationary transition} setting and this improves over the best known upper bound by a factor of $H$. Moreover, we establish an information-theoretic lower bound of $\Omega(H^2/d_m\epsilon^2)$ which certifies that OPDVR is optimal up to logarithmic factors. Lastly, we show that OPDVR also achieves rate-optimal sample complexity under alternative settings such as the finite-horizon MDPs with non-stationary transitions and the infinite horizon MDPs with discounted rewards.

Author Information

Ming Yin (UC Santa Barbara)
Yu Bai (Salesforce Research)
Yu-Xiang Wang (UC Santa Barbara)
Yu-Xiang Wang

Yu-Xiang Wang is the Eugene Aas Assistant Professor of Computer Science at UCSB. He runs the Statistical Machine Learning lab and co-founded the UCSB Center for Responsible Machine Learning. He is also visiting Amazon Web Services. Yu-Xiang’s research interests include statistical theory and methodology, differential privacy, reinforcement learning, online learning and deep learning.

More from the Same Authors