Spotlight

Understanding Policy Gradient Algorithms: A Sensitivity-Based Approach

Shuang Wu ⋅ Ling Shi ⋅ Jun Wang ⋅ Guangjian Tian

Keywords: RL: Total Cost/Reward RL: Policy Search RL: Average Cost/Reward RL: Discounted Cost/Reward

2022 Spotlight

[ Slides] [ Paper PDF]

Abstract

The REINFORCE algorithm \cite{williams1992simple} is popular in policy gradient (PG) for solving reinforcement learning (RL) problems. Meanwhile, the theoretical form of PG is from~\cite{sutton1999policy}. Although both formulae prescribe PG, their precise connections are not yet illustrated. Recently, \citeauthor{nota2020policy} (\citeyear{nota2020policy}) have found that the ambiguity causes implementation errors. Motivated by the ambiguity and implementation incorrectness, we study PG from a perturbation perspective. In particular, we derive PG in a unified framework, precisely clarify the relation between PG implementation and theory, and echos back the findings by \citeauthor{nota2020policy}. Diving into factors contributing to empirical successes of the existing erroneous implementations, we find that small approximation error and the experience replay mechanism play critical roles.

Video

Chat is not available.