Timezone: »

Optimal Estimation of Off-Policy Policy Gradient via Double Fitted Iteration
Chengzhuo Ni · Ruiqi Zhang · Xiang Ji · Xuezhou Zhang · Mengdi Wang

Thu Jul 21 11:35 AM -- 11:40 AM (PDT) @ None

Policy gradient (PG) estimation becomes a challenge when we are not allowed to sample with the target policy but only have access to off-policy data generated by some unknown behavior policy. Conventional methods for off-policy PG estimation often suffer from either significant bias or exponentially large variance. In this paper, we propose the double Fitted PG estimation (FPG) algorithm. FPG can work with an arbitrary policy parameterization, assuming access to a Bellman-complete value function class. In the case of linear value function approximation, we provide a tight finite-sample policy gradient error bound that is governed by the amount of distribution mismatch measured in feature space. We also establish the asymptotic normality of FPG estimation error with a precise covariance characterization, which is further shown to be statistically optimal with a matching Cramer-Rao lower bound. Empirically, we evaluate the performance FPG on both policy gradient estimation and policy optimization. Under various metrics, our results show that FPG significantly outperforms existing off-policy PG estimation methods based on importance sampling and variance reduction techniques.

Author Information

Chengzhuo Ni (Princeton University)
Ruiqi Zhang (Peking University)
Xiang Ji (Princeton University)
Xuezhou Zhang (Princeton)
Mengdi Wang (Princeton University)

Related Events (a corresponding poster, oral, or spotlight)

More from the Same Authors