Track: Reinforcement Learning 14

Fri 13 July 0:30 - 0:50 PDT

RLlib: Abstractions for Distributed Reinforcement Learning

Eric Liang · Richard Liaw · Robert Nishihara · Philipp Moritz · Roy Fox · Ken Goldberg · Joseph E Gonzalez · Michael Jordan · Ion Stoica

Reinforcement learning (RL) algorithms involve the deep nesting of highly irregular computation patterns, each of which typically exhibits opportunities for distributed computation. We argue for distributing RL components in a composable way by adapting algorithms for top-down hierarchical control, thereby encapsulating parallelism and resource requirements within short-running compute tasks. We demonstrate the benefits of this principle through RLlib: a library that provides scalable software primitives for RL. These primitives enable a broad range of algorithms to be implemented with high performance, scalability, and substantial code reuse. RLlib is available as part of the open source Ray project at http://rllib.io/.

Fri 13 July 0:50 - 1:10 PDT

IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures

Lasse Espeholt · Hubert Soyer · Remi Munos · Karen Simonyan · Vlad Mnih · Tom Ward · Yotam Doron · Vlad Firoiu · Tim Harley · Iain Dunning · Shane Legg · Koray Kavukcuoglu

In this work we aim to solve a large collection oftasks using a single reinforcement learning agentwith a single set of parameters. A key challengeis to handle the increased amount of data and extendedtraining time. We have developed a newdistributed agent IMPALA (Importance WeightedActor-Learner Architecture) that not only usesresources more efficiently in single-machine trainingbut also scales to thousands of machines withoutsacrificing data efficiency or resource utilisation.We achieve stable learning at high throughputby combining decoupled acting and learningwith a novel off-policy correction method calledV-trace. We demonstrate the effectiveness of IMPALAfor multi-task reinforcement learning onDMLab-30 (a set of 30 tasks from the DeepMindLab environment (Beattie et al., 2016)) and Atari57(all available Atari games in Arcade LearningEnvironment (Bellemare et al., 2013a)). Our resultsshow that IMPALA is able to achieve betterperformance than previous agents with less data,and crucially exhibits positive transfer betweentasks as a result of its multi-task approach.

Fri 13 July 1:10 - 1:20 PDT

Mix & Match - Agent Curricula for Reinforcement Learning

Wojciech Czarnecki · Siddhant Jayakumar · Max Jaderberg · Leonard Hasenclever · Yee Teh · Nicolas Heess · Simon Osindero · Razvan Pascanu

We introduce Mix and match (M&M) -- a training framework designed to facilitate rapid and effective learning in RL agents that would be too slow or too challenging to train otherwise.The key innovation is a procedure that allows us to automatically form a curriculum over agents. Through such a curriculum we can progressively train more complex agents by, effectively, bootstrapping from solutions found by simpler agents.In contradistinction to typical curriculum learning approaches, we do not gradually modify the tasks or environments presented, but instead use a process to gradually alter how the policy is represented internally.We show the broad applicability of our method by demonstrating significant performance gains in three different experimental setups: (1) We train an agent able to control more than 700 actions in a challenging 3D first-person task; using our method to progress through an action-space curriculum we achieve both faster training and better final performance than one obtains using traditional methods.(2) We further show that M&M can be used successfully to progress through a curriculum of architectural variants defining an agents internal state. (3) Finally, we illustrate how a variant of our method can be used to improve agent performance in a multitask setting.

Fri 13 July 1:20 - 1:30 PDT

Learning to Explore via Meta-Policy Gradient

Tianbing Xu · Qiang Liu · Liang Zhao · Jian Peng

The performance of off-policy learning, including deep Q-learning and deep deterministic policy gradient (DDPG), critically depends on the choice of the exploration policy. Existing exploration methods are mostly based on adding noise to the on-going actor policy and can only explore \emph{local} regions close to what the actor policy dictates. In this work, we develop a simple meta-policy gradient algorithm that allows us to adaptively learn the exploration policy in DDPG. Our algorithm allows us to train flexible exploration behaviors that are independent of the actor policy, yielding a \emph{global exploration} that significantly speeds up the learning process. With an extensive study, we show that our method significantly improves the sample-efficiency of DDPG on a variety of reinforcement learning continuous control tasks.

Main Navigation

Session

Reinforcement Learning 14

RLlib: Abstractions for Distributed Reinforcement Learning

IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures

Mix & Match - Agent Curricula for Reinforcement Learning

Learning to Explore via Meta-Policy Gradient