Skip to yearly menu bar Skip to main content


Poster

Model-based Reinforcement Learning for Confounded POMDPs

Mao Hong · Zhengling Qi · Yanxun Xu


Abstract:

We propose a model-based offline reinforcement learning (RL) method for confounded partially observable Markov decision processes (POMDPs) under general function approximations, which is provably efficient under the assumption of partial coverage imposed on the offline dataset. Specifically, we first establish a novel model-based identification result for learning the effect of any action on the reward and future transitions in confounded POMDPs. Using these identification results, we then design a nonparametric two-stage estimation procedure to construct estimators for policy values which permits general function approximations. Finally, we learn the optimal policy by performing a conservative policy optimization within the confidence regions which are constructed based on the proposed estimation procedure. Under some mild conditions, we establish a finite-sample upper bound on the suboptimality of the learned policy in finding the optimal one, which polynomially depends on the sample size and the length of horizons.

Live content is unavailable. Log in and register to view live content