Timezone: »
We consider the task of evaluating a policy for a Markov decision process (MDP). The standard unbiased technique for evaluating a policy is to deploy the policy and observe its performance. We show that the data collected from deploying a different policy, commonly called the behavior policy, can be used to produce unbiased estimates with lower mean squared error than this standard technique. We derive an analytic expression for the optimal behavior policy --- the behavior policy that minimizes the mean squared error of the resulting estimates. Because this expression depends on terms that are unknown in practice, we propose a novel policy evaluation sub-problem, behavior policy search: searching for a behavior policy that reduces mean squared error. We present a behavior policy search algorithm and empirically demonstrate its effectiveness in lowering the mean squared error of policy performance estimates.
Author Information
Josiah Hanna (University of Texas at Austin)
Philip S. Thomas (CMU)
Peter Stone (University of Texas at Austin)
Scott Niekum (University of Texas at Austin)
Related Events (a corresponding poster, oral, or spotlight)
-
2017 Poster: Data-Efficient Policy Evaluation Through Behavior Policy Search »
Tue. Aug 8th 08:30 AM -- 12:00 PM Room Gallery #33
More from the Same Authors
-
2022 : Model-Based Meta Automatic Curriculum Learning »
Zifan Xu · Yulin Zhang · Shahaf Shperberg · Reuth Mirsky · Yuqian Jiang · Bo Liu · Peter Stone -
2022 : Task Factorization in Curriculum Learning »
Reuth Mirsky · Shahaf Shperberg · Yulin Zhang · Zifan Xu · Yuqian Jiang · Jiaxun Cui · Peter Stone -
2022 : Q/A: Invited Speaker: Peter Stone »
Peter Stone -
2022 : Invited Speaker: Peter Stone »
Peter Stone -
2022 Poster: Causal Dynamics Learning for Task-Independent State Abstraction »
Zizhao Wang · Xuesu Xiao · Zifan Xu · Yuke Zhu · Peter Stone -
2022 Oral: Causal Dynamics Learning for Task-Independent State Abstraction »
Zizhao Wang · Xuesu Xiao · Zifan Xu · Yuke Zhu · Peter Stone -
2021 : Scaling up Probabilistic Safe Learning »
Scott Niekum -
2021 Poster: Value Alignment Verification »
Daniel Brown · Jordan Schneider · Anca Dragan · Scott Niekum -
2021 Poster: Coach-Player Multi-agent Reinforcement Learning for Dynamic Team Composition »
Bo Liu · Qiang Liu · Peter Stone · Animesh Garg · Yuke Zhu · Anima Anandkumar -
2021 Oral: Coach-Player Multi-agent Reinforcement Learning for Dynamic Team Composition »
Bo Liu · Qiang Liu · Peter Stone · Animesh Garg · Yuke Zhu · Anima Anandkumar -
2021 Spotlight: Value Alignment Verification »
Daniel Brown · Jordan Schneider · Anca Dragan · Scott Niekum -
2020 Poster: Reducing Sampling Error in Batch Temporal Difference Learning »
Brahma Pavse · Ishan Durugkar · Josiah Hanna · Peter Stone -
2020 Poster: Safe Imitation Learning via Fast Bayesian Reward Inference from Preferences »
Daniel Brown · Russell Coleman · Ravi Srinivasan · Scott Niekum -
2019 : Peter Stone: Learning Curricula for Transfer Learning in RL »
Peter Stone -
2019 : panel discussion with Craig Boutilier (Google Research), Emma Brunskill (Stanford), Chelsea Finn (Google Brain, Stanford, UC Berkeley), Mohammad Ghavamzadeh (Facebook AI), John Langford (Microsoft Research) and David Silver (Deepmind) »
Peter Stone · Craig Boutilier · Emma Brunskill · Chelsea Finn · John Langford · David Silver · Mohammad Ghavamzadeh -
2019 : Invited Talk 1: Adaptive Tolling for Multiagent Traffic Optimization »
Peter Stone -
2019 Poster: Extrapolating Beyond Suboptimal Demonstrations via Inverse Reinforcement Learning from Observations »
Daniel Brown · Wonjoon Goo · Prabhat Nagarajan · Scott Niekum -
2019 Oral: Extrapolating Beyond Suboptimal Demonstrations via Inverse Reinforcement Learning from Observations »
Daniel Brown · Wonjoon Goo · Prabhat Nagarajan · Scott Niekum -
2019 Poster: Importance Sampling Policy Evaluation with an Estimated Behavior Policy »
Josiah Hanna · Scott Niekum · Peter Stone -
2019 Oral: Importance Sampling Policy Evaluation with an Estimated Behavior Policy »
Josiah Hanna · Scott Niekum · Peter Stone