Timezone: »
We show that on-policy policy gradient (PG) and its variance reduction variants can be derived by taking finite-difference of function evaluations supplied by estimators from the importance sampling (IS) family for off-policy evaluation (OPE). Starting from the doubly robust (DR) estimator (Jiang & Li, 2016), we provide a simple derivation of a very general and flexible form of PG, which subsumes the state-of-the-art variance reduction technique (Cheng et al., 2019) as its special case and immediately hints at further variance reduction opportunities overlooked by existing literature. We analyze the variance of the new DR-PG estimator, compare it to existing methods as well as the Cramer-Rao lower bound of policy gradient, and empirically show its effectiveness.
Author Information
Jiawei Huang (University of Illinois at Urbana-Champaign)
Nan Jiang (University of Illinois at Urbana-Champaign)
More from the Same Authors
-
2021 : A Spectral Approach to Off-Policy Evaluation for POMDPs »
Yash Nair · Nan Jiang -
2021 : Policy Finetuning: Bridging Sample-Efficient Offline and Online Reinforcement Learning »
Tengyang Xie · Nan Jiang · Huan Wang · Caiming Xiong · Yu Bai -
2022 : Interaction-Grounded Learning with Action-inclusive Feedback »
Tengyang Xie · Akanksha Saran · Dylan Foster · Lekan Molu · Ida Momennejad · Nan Jiang · Paul Mineiro · John Langford -
2022 : Beyond the Return: Off-policy Function Estimation under User-specified Error-measuring Distributions »
Audrey Huang · Nan Jiang -
2023 Poster: Offline Learning in Markov Games with General Function Approximation »
Yuheng Zhang · Yu Bai · Nan Jiang -
2023 Poster: The Optimal Approximation Factors in Misspecified Off-Policy Value Function Estimation »
Philip Amortila · Nan Jiang · Csaba Szepesvari -
2023 Poster: Reinforcement Learning in Low-rank MDPs with Density Features »
Audrey Huang · Jinglin Chen · Nan Jiang -
2022 Poster: Adversarially Trained Actor Critic for Offline Reinforcement Learning »
Ching-An Cheng · Tengyang Xie · Nan Jiang · Alekh Agarwal -
2022 Oral: Adversarially Trained Actor Critic for Offline Reinforcement Learning »
Ching-An Cheng · Tengyang Xie · Nan Jiang · Alekh Agarwal -
2022 Poster: A Minimax Learning Approach to Off-Policy Evaluation in Confounded Partially Observable Markov Decision Processes »
Chengchun Shi · Masatoshi Uehara · Jiawei Huang · Nan Jiang -
2022 Oral: A Minimax Learning Approach to Off-Policy Evaluation in Confounded Partially Observable Markov Decision Processes »
Chengchun Shi · Masatoshi Uehara · Jiawei Huang · Nan Jiang -
2021 Poster: Batch Value-function Approximation with Only Realizability »
Tengyang Xie · Nan Jiang -
2021 Spotlight: Batch Value-function Approximation with Only Realizability »
Tengyang Xie · Nan Jiang -
2020 Poster: Minimax Weight and Q-Function Learning for Off-Policy Evaluation »
Masatoshi Uehara · Jiawei Huang · Nan Jiang -
2019 Poster: Provably efficient RL with Rich Observations via Latent State Decoding »
Simon Du · Akshay Krishnamurthy · Nan Jiang · Alekh Agarwal · Miroslav Dudik · John Langford -
2019 Poster: Information-Theoretic Considerations in Batch Reinforcement Learning »
Jinglin Chen · Nan Jiang -
2019 Oral: Information-Theoretic Considerations in Batch Reinforcement Learning »
Jinglin Chen · Nan Jiang -
2019 Oral: Provably efficient RL with Rich Observations via Latent State Decoding »
Simon Du · Akshay Krishnamurthy · Nan Jiang · Alekh Agarwal · Miroslav Dudik · John Langford