Timezone: »

Off-Policy Actor-Critic with Shared Experience Replay
Simon Schmitt · Matteo Hessel · Karen Simonyan

Thu Jul 16 03:00 PM -- 03:45 PM & Fri Jul 17 04:00 AM -- 04:45 AM (PDT) @ None #None

We investigate the combination of actor-critic reinforcement learning algorithms with a uniform large-scale experience replay and propose solutions for two ensuing challenges: (a) efficient actor-critic learning with experience replay (b) the stability of off-policy learning where agents learn from other agents behaviour. To this end we analyze the bias-variance tradeoffs in V-trace, a form of importance sampling for actor-critic methods. Based on our analysis, we then argue for mixing experience sampled from replay with on-policy experience, and propose a new trust region scheme that scales effectively to data distributions where V-trace becomes unstable. We provide extensive empirical validation of the proposed solutions on DMLab-30 and further show the benefits of this setup in two training regimes for Atari: (1) a single agent is trained up until 200M environment frames per game (2) a population of agents is trained up until 200M environment frames each and may share experience. We demonstrate state-of-the-art data efficiency among model-free agents in both regimes.

Author Information

Simon Schmitt (DeepMind)
Matteo Hessel (Deep Mind)
Karen Simonyan (DeepMind)

More from the Same Authors