Timezone: »

Minimax-Optimal Off-Policy Evaluation with Linear Function Approximation
Yaqi Duan · Zeyu Jia · Mengdi Wang

Thu Jul 16 06:00 AM -- 06:45 AM & Thu Jul 16 05:00 PM -- 05:45 PM (PDT) @ None #None
This paper studies the statistical theory of off-policy evaluation with function approximation in batch data reinforcement learning problem. We consider a regression-based fitted Q-iteration method, show that it is equivalent to a model-based method that estimates a conditional mean embedding of the transition operator, and prove that this method is information-theoretically optimal and has nearly minimal estimation error. In particular, by leveraging contraction property of Markov processes and martingale concentration, we establish a finite-sample instance-dependent error upper bound and a nearly-matching minimax lower bound. The policy evaluation error depends sharply on a restricted $\chi^2$-divergence over the function class between the long-term distribution of target policy and the distribution of past data. This restricted $\chi^2$-divergence characterizes the statistical limit of off-policy evaluation and is both instance-dependent and function-class-dependent. Further, we provide an easily computable confidence bound for the policy evaluator, which may be useful for optimistic planning and safe policy improvement.

Author Information

Yaqi Duan (Princeton University)
Zeyu Jia (Peking University)
Mengdi Wang (Princeton University)

More from the Same Authors