Timezone: »
We consider the problem of off-policy evaluation in Markov decision processes. Off-policy evaluation is the task of evaluating the expected return of one policy with data generated by a different, behavior policy. Importance sampling is a technique for off-policy evaluation that re-weights off-policy returns to account for differences in the likelihood of the returns between the two policies. In this paper, we study importance sampling with an estimated behavior policy where the behavior policy estimate comes from the same set of data used to compute the importance sampling estimate. We find that this estimator often lowers the mean squared error of off-policy evaluation compared to importance sampling with the true behavior policy or using a behavior policy that is estimated from a separate data set. Intuitively, estimating the behavior policy in this way corrects for error due to sampling in the action-space. Our empirical results also extend to other popular variants of importance sampling and show that estimating a non-Markovian behavior policy can further lower large-sample mean squared error even when the true behavior policy is Markovian.
Author Information
Josiah Hanna (UT Austin)
Scott Niekum (University of Texas at Austin)
Peter Stone (University of Texas at Austin)
Related Events (a corresponding poster, oral, or spotlight)
-
2019 Poster: Importance Sampling Policy Evaluation with an Estimated Behavior Policy »
Wed. Jun 12th 01:30 -- 04:00 AM Room Pacific Ballroom #109
More from the Same Authors
-
2021 : Decoupling Exploration and Exploitation in Reinforcement Learning »
Lukas Schäfer · Filippos Christianos · Josiah Hanna · Stefano V. Albrecht -
2022 : Model-Based Meta Automatic Curriculum Learning »
Zifan Xu · Yulin Zhang · Shahaf Shperberg · Reuth Mirsky · Yuqian Jiang · Bo Liu · Peter Stone -
2022 : Task Factorization in Curriculum Learning »
Reuth Mirsky · Shahaf Shperberg · Yulin Zhang · Zifan Xu · Yuqian Jiang · Jiaxun Cui · Peter Stone -
2022 : Q/A: Invited Speaker: Peter Stone »
Peter Stone -
2022 : Invited Speaker: Peter Stone »
Peter Stone -
2022 Poster: Causal Dynamics Learning for Task-Independent State Abstraction »
Zizhao Wang · Xuesu Xiao · Zifan Xu · Yuke Zhu · Peter Stone -
2022 Oral: Causal Dynamics Learning for Task-Independent State Abstraction »
Zizhao Wang · Xuesu Xiao · Zifan Xu · Yuke Zhu · Peter Stone -
2021 : Scaling up Probabilistic Safe Learning »
Scott Niekum -
2021 Poster: Value Alignment Verification »
Daniel Brown · Jordan Schneider · Anca Dragan · Scott Niekum -
2021 Poster: Coach-Player Multi-agent Reinforcement Learning for Dynamic Team Composition »
Bo Liu · Qiang Liu · Peter Stone · Animesh Garg · Yuke Zhu · Anima Anandkumar -
2021 Oral: Coach-Player Multi-agent Reinforcement Learning for Dynamic Team Composition »
Bo Liu · Qiang Liu · Peter Stone · Animesh Garg · Yuke Zhu · Anima Anandkumar -
2021 Spotlight: Value Alignment Verification »
Daniel Brown · Jordan Schneider · Anca Dragan · Scott Niekum -
2020 Poster: Reducing Sampling Error in Batch Temporal Difference Learning »
Brahma Pavse · Ishan Durugkar · Josiah Hanna · Peter Stone -
2020 Poster: Safe Imitation Learning via Fast Bayesian Reward Inference from Preferences »
Daniel Brown · Russell Coleman · Ravi Srinivasan · Scott Niekum -
2019 : Peter Stone: Learning Curricula for Transfer Learning in RL »
Peter Stone -
2019 : panel discussion with Craig Boutilier (Google Research), Emma Brunskill (Stanford), Chelsea Finn (Google Brain, Stanford, UC Berkeley), Mohammad Ghavamzadeh (Facebook AI), John Langford (Microsoft Research) and David Silver (Deepmind) »
Peter Stone · Craig Boutilier · Emma Brunskill · Chelsea Finn · John Langford · David Silver · Mohammad Ghavamzadeh -
2019 : Invited Talk 1: Adaptive Tolling for Multiagent Traffic Optimization »
Peter Stone -
2019 Poster: Extrapolating Beyond Suboptimal Demonstrations via Inverse Reinforcement Learning from Observations »
Daniel Brown · Wonjoon Goo · Prabhat Nagarajan · Scott Niekum -
2019 Oral: Extrapolating Beyond Suboptimal Demonstrations via Inverse Reinforcement Learning from Observations »
Daniel Brown · Wonjoon Goo · Prabhat Nagarajan · Scott Niekum -
2017 Poster: Data-Efficient Policy Evaluation Through Behavior Policy Search »
Josiah Hanna · Philip S. Thomas · Peter Stone · Scott Niekum -
2017 Talk: Data-Efficient Policy Evaluation Through Behavior Policy Search »
Josiah Hanna · Philip S. Thomas · Peter Stone · Scott Niekum