Timezone: »

Provably Efficient Offline Reinforcement Learning for Partially Observable Markov Decision Processes
Hongyi Guo · Qi Cai · Yufeng Zhang · Zhuoran Yang · Zhaoran Wang

Thu Jul 21 11:40 AM -- 11:45 AM (PDT) @ None

We study offline reinforcement learning (RL) for partially observable Markov decision processes (POMDPs) with possibly infinite state and observation spaces. Under the undercompleteness assumption, the optimal policy in such POMDPs are characterized by a class of finite-memory Bellman operators. In the offline setting, estimating these operators directly is challenging due to (i) the large observation space and (ii) insufficient coverage of the offline dataset. To tackle these challenges, we propose a novel algorithm that constructs confidence regions for these Bellman operators via offline estimation of their RKHS embeddings, and returns the final policy via pessimistic planning within the confidence regions. We prove that the proposed algorithm attains an (\epsilon)-optimal policy using an offline dataset containing (\tilde\cO(1 / \epsilon^2)) episodes, provided that the behavior policy has good coverage over the optimal trajectory. To our best knowledge, our algorithm is the first provably sample efficient offline algorithm for POMDPs without uniform coverage assumptions.

Author Information

Hongyi Guo (Northwestern University)
Qi Cai (Northwestern University)
Yufeng Zhang (Northwestern University)
Zhuoran Yang (Yale University)
Zhaoran Wang (Northwestern University)

Related Events (a corresponding poster, oral, or spotlight)

More from the Same Authors