Timezone: »
Hindsight allows reinforcement learning agents to leverage new observations to make inferences about earlier states and transitions. In this paper, we exploit the idea of hindsight and introduce posterior value functions. Posterior value functions are computed by inferring the posterior distribution over hidden components of the state in previous timesteps and can be used to construct novel unbiased baselines for policy gradient methods. Importantly, we prove that these baselines reduce (and never increase) the variance of policy gradient estimators compared to traditional state value functions. While the posterior value function is motivated by partial observability, we extend these results to arbitrary stochastic MDPs by showing that hindsight-capable agents can model stochasticity in the environment as a special case of partial observability. Finally, we introduce a pair of methods for learning posterior value functions and prove their convergence.
Author Information
Chris Nota (University of Massachusetts Amherst)
Philip Thomas (University of Massachusetts Amherst)
Bruno C. da Silva (University of Massachusetts)
Related Events (a corresponding poster, oral, or spotlight)
-
2021 Spotlight: Posterior Value Functions: Hindsight Baselines for Policy Gradient Methods »
Thu. Jul 22nd 01:40 -- 01:45 PM Room
More from the Same Authors
-
2022 Poster: Constrained Offline Policy Optimization »
Nicholas Polosky · Bruno C. da Silva · Madalina Fiterau · Jithin Jagannath -
2022 Spotlight: Constrained Offline Policy Optimization »
Nicholas Polosky · Bruno C. da Silva · Madalina Fiterau · Jithin Jagannath -
2022 Poster: Optimistic Linear Support and Successor Features as a Basis for Optimal Policy Transfer »
Lucas N. Alegre · Ana Lucia Cetertich Bazzan · Bruno C. da Silva -
2022 Spotlight: Optimistic Linear Support and Successor Features as a Basis for Optimal Policy Transfer »
Lucas N. Alegre · Ana Lucia Cetertich Bazzan · Bruno C. da Silva -
2021 Spotlight: Towards Practical Mean Bounds for Small Samples »
My Phan · Philip Thomas · Erik Learned-Miller -
2021 Poster: Towards Practical Mean Bounds for Small Samples »
My Phan · Philip Thomas · Erik Learned-Miller -
2021 Poster: High Confidence Generalization for Reinforcement Learning »
James Kostas · Yash Chandak · Scott Jordan · Georgios Theocharous · Philip Thomas -
2021 Spotlight: High Confidence Generalization for Reinforcement Learning »
James Kostas · Yash Chandak · Scott Jordan · Georgios Theocharous · Philip Thomas -
2021 : Minimum-Delay Adaptation in Non-Stationary Reinforcement Learning via Online High-Confidence Change-Point Detection »
Lucas N. Alegre · Ana Lucia Cetertich Bazzan · Bruno C. da Silva -
2020 Poster: Asynchronous Coagent Networks »
James Kostas · Chris Nota · Philip Thomas -
2020 Poster: Evaluating the Performance of Reinforcement Learning Algorithms »
Scott Jordan · Yash Chandak · Daniel Cohen · Mengxue Zhang · Philip Thomas -
2020 Poster: Optimizing for the Future in Non-Stationary MDPs »
Yash Chandak · Georgios Theocharous · Shiv Shankar · Martha White · Sridhar Mahadevan · Philip Thomas -
2019 Poster: Concentration Inequalities for Conditional Value at Risk »
Philip Thomas · Erik Learned-Miller -
2019 Oral: Concentration Inequalities for Conditional Value at Risk »
Philip Thomas · Erik Learned-Miller -
2019 Poster: Learning Action Representations for Reinforcement Learning »
Yash Chandak · Georgios Theocharous · James Kostas · Scott Jordan · Philip Thomas -
2019 Oral: Learning Action Representations for Reinforcement Learning »
Yash Chandak · Georgios Theocharous · James Kostas · Scott Jordan · Philip Thomas -
2018 Poster: Decoupling Gradient-Like Learning Rules from Representations »
Philip Thomas · Christoph Dann · Emma Brunskill -
2018 Oral: Decoupling Gradient-Like Learning Rules from Representations »
Philip Thomas · Christoph Dann · Emma Brunskill