Skip to yearly menu bar Skip to main content


Understanding Policy Gradient Algorithms: A Sensitivity-Based Approach

Shuang Wu · Ling Shi · Jun Wang · Guangjian Tian

Hall E #929

Keywords: [ RL: Average Cost/Reward ] [ RL: Discounted Cost/Reward ] [ RL: Total Cost/Reward ] [ RL: Policy Search ]


The REINFORCE algorithm \cite{williams1992simple} is popular in policy gradient (PG) for solving reinforcement learning (RL) problems. Meanwhile, the theoretical form of PG is from~\cite{sutton1999policy}. Although both formulae prescribe PG, their precise connections are not yet illustrated. Recently, \citeauthor{nota2020policy} (\citeyear{nota2020policy}) have found that the ambiguity causes implementation errors. Motivated by the ambiguity and implementation incorrectness, we study PG from a perturbation perspective. In particular, we derive PG in a unified framework, precisely clarify the relation between PG implementation and theory, and echos back the findings by \citeauthor{nota2020policy}. Diving into factors contributing to empirical successes of the existing erroneous implementations, we find that small approximation error and the experience replay mechanism play critical roles.

Chat is not available.