Timezone: »

Revisiting Peng's Q($\lambda$) for Modern Reinforcement Learning
Tadashi Kozuno · Yunhao Tang · Mark Rowland · Remi Munos · Steven Kapturowski · Will Dabney · Michal Valko · David Abel

Tue Jul 20 09:00 PM -- 11:00 PM (PDT) @ Virtual #None
Off-policy multi-step reinforcement learning algorithms consist of conservative and non-conservative algorithms: the former actively cut traces, whereas the latter do not. Recently, Munos et al. (2016) proved the convergence of conservative algorithms to an optimal Q-function. In contrast, non-conservative algorithms are thought to be unsafe and have a limited or no theoretical guarantee. Nonetheless, recent studies have shown that non-conservative algorithms empirically outperform conservative ones. Motivated by the empirical results and the lack of theory, we carry out theoretical analyses of Peng's Q($\lambda$), a representative example of non-conservative algorithms. We prove that \emph{it also converges to an optimal policy} provided that the behavior policy slowly tracks a greedy policy in a way similar to conservative policy iteration. Such a result has been conjectured to be true but has not been proven. We also experiment with Peng's Q($\lambda$) in complex continuous control tasks, confirming that Peng's Q($\lambda$) often outperforms conservative algorithms despite its simplicity. These results indicate that Peng's Q($\lambda$), which was thought to be unsafe, is a theoretically-sound and practically effective algorithm.

Author Information

Tadashi Kozuno (University of Alberta)
Yunhao Tang (Columbia University)
Mark Rowland (DeepMind)
Remi Munos (DeepMind)
Steven Kapturowski (Deepmind)
Will Dabney (DeepMind)
Michal Valko (DeepMind / Inria / ENS Paris-Saclay)
Dave Abel (DeepMind)

Related Events (a corresponding poster, oral, or spotlight)

More from the Same Authors