Timezone: »

DoMo-AC: Doubly Multi-step Off-policy Actor-Critic Algorithm
Yunhao Tang · Tadashi Kozuno · Mark Rowland · Anna Harutyunyan · Remi Munos · Bernardo Avila Pires · Michal Valko

Tue Jul 25 05:00 PM -- 06:30 PM (PDT) @ Exhibit Hall 1 #521

Multi-step learning applies lookahead over multiple time steps and has proved valuable in policy evaluation settings. However, in the optimal control case, the impact of multi-step learning has been relatively limited despite a number of prior efforts. Fundamentally, this might be because multi-step policy improvements require operations that cannot be approximated by stochastic samples, hence hindering the widespread adoption of such methods in practice. To address such limitations, we introduce doubly multi-step off-policy VI (DoMo-VI), a novel oracle algorithm that combines multi-step policy improvements and policy evaluations. DoMo-VI enjoys guaranteed convergence speed-up to the optimal policy and is applicable in general off-policy learning settings. We then propose doubly multi-step off-policy actor-critic (DoMo-AC), a practical instantiation of the DoMo-VI algorithm. DoMo-AC introduces a bias-variance trade-off that ensures improved policy gradient estimates. When combined with the IMPALA architecture, DoMo-AC has showed improvements over the baseline algorithm on Atari-57 game benchmarks.

Author Information

Yunhao Tang (Google DeepMind)
Tadashi Kozuno (Omron Sinic X)
Mark Rowland (Google DeepMind)
Anna Harutyunyan (DeepMind)
Remi Munos (DeepMind)
Bernardo Avila Pires (Google DeepMind)
Michal Valko (Google DeepMind / Inria / MVA)

More from the Same Authors