Timezone: »

Posterior Value Functions: Hindsight Baselines for Policy Gradient Methods
Chris Nota · Philip Thomas · Bruno C. da Silva

Thu Jul 22 09:00 AM -- 11:00 AM (PDT) @

Hindsight allows reinforcement learning agents to leverage new observations to make inferences about earlier states and transitions. In this paper, we exploit the idea of hindsight and introduce posterior value functions. Posterior value functions are computed by inferring the posterior distribution over hidden components of the state in previous timesteps and can be used to construct novel unbiased baselines for policy gradient methods. Importantly, we prove that these baselines reduce (and never increase) the variance of policy gradient estimators compared to traditional state value functions. While the posterior value function is motivated by partial observability, we extend these results to arbitrary stochastic MDPs by showing that hindsight-capable agents can model stochasticity in the environment as a special case of partial observability. Finally, we introduce a pair of methods for learning posterior value functions and prove their convergence.

Author Information

Chris Nota (University of Massachusetts Amherst)
Philip Thomas (University of Massachusetts Amherst)
Bruno C. da Silva (University of Massachusetts)

Related Events (a corresponding poster, oral, or spotlight)

More from the Same Authors