Skip to yearly menu bar Skip to main content


Poster

Model-based Reinforcement Learning for Confounded POMDPs

Mao Hong · Zhengling Qi · Yanxun Xu

Hall C 4-9 #1215
[ ]
Wed 24 Jul 4:30 a.m. PDT — 6 a.m. PDT

Abstract:

We propose a model-based offline reinforcement learning (RL) algorithm for confounded partially observable Markov decision processes (POMDPs) under general function approximations and show it is provably efficient under some technical conditions such as the partial coverage imposed on the offline data distribution. Specifically, we first establish a novel model-based identification result for learning the effect of any action on the reward and future transitions in the confounded POMDP. Using this identification result, we then design a nonparametric two-stage estimation procedure to construct an estimator for off-policy evaluation (OPE), which permits general function approximations. Finally, we learn the optimal policy by performing a conservative policy optimization within the confidence regions based on the proposed estimation procedure for OPE. Under some mild conditions, we establish a finite-sample upper bound on the suboptimality of the learned policy in finding the optimal one, which depends on the sample size and the length of horizons polynomially.

Live content is unavailable. Log in and register to view live content