Session
Reinforcement Learning 14
RLlib: Abstractions for Distributed Reinforcement Learning
Eric Liang · Richard Liaw · Robert Nishihara · Philipp Moritz · Roy Fox · Ken Goldberg · Joseph E Gonzalez · Michael Jordan · Ion Stoica
Reinforcement learning (RL) algorithms involve the deep nesting of highly irregular computation patterns, each of which typically exhibits opportunities for distributed computation. We argue for distributing RL components in a composable way by adapting algorithms for top-down hierarchical control, thereby encapsulating parallelism and resource requirements within short-running compute tasks. We demonstrate the benefits of this principle through RLlib: a library that provides scalable software primitives for RL. These primitives enable a broad range of algorithms to be implemented with high performance, scalability, and substantial code reuse. RLlib is available as part of the open source Ray project at http://rllib.io/.
IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures
Lasse Espeholt · Hubert Soyer · Remi Munos · Karen Simonyan · Vlad Mnih · Tom Ward · Yotam Doron · Vlad Firoiu · Tim Harley · Iain Dunning · Shane Legg · Koray Kavukcuoglu
In this work we aim to solve a large collection oftasks using a single reinforcement learning agentwith a single set of parameters. A key challengeis to handle the increased amount of data and extendedtraining time. We have developed a newdistributed agent IMPALA (Importance WeightedActor-Learner Architecture) that not only usesresources more efficiently in single-machine trainingbut also scales to thousands of machines withoutsacrificing data efficiency or resource utilisation.We achieve stable learning at high throughputby combining decoupled acting and learningwith a novel off-policy correction method calledV-trace. We demonstrate the effectiveness of IMPALAfor multi-task reinforcement learning onDMLab-30 (a set of 30 tasks from the DeepMindLab environment (Beattie et al., 2016)) and Atari57(all available Atari games in Arcade LearningEnvironment (Bellemare et al., 2013a)). Our resultsshow that IMPALA is able to achieve betterperformance than previous agents with less data,and crucially exhibits positive transfer betweentasks as a result of its multi-task approach.
Mix & Match - Agent Curricula for Reinforcement Learning
Wojciech Czarnecki · Siddhant Jayakumar · Max Jaderberg · Leonard Hasenclever · Yee Teh · Nicolas Heess · Simon Osindero · Razvan Pascanu
We introduce Mix and match (M&M) -- a training framework designed to facilitate rapid and effective learning in RL agents that would be too slow or too challenging to train otherwise.The key innovation is a procedure that allows us to automatically form a curriculum over agents. Through such a curriculum we can progressively train more complex agents by, effectively, bootstrapping from solutions found by simpler agents.In contradistinction to typical curriculum learning approaches, we do not gradually modify the tasks or environments presented, but instead use a process to gradually alter how the policy is represented internally.We show the broad applicability of our method by demonstrating significant performance gains in three different experimental setups: (1) We train an agent able to control more than 700 actions in a challenging 3D first-person task; using our method to progress through an action-space curriculum we achieve both faster training and better final performance than one obtains using traditional methods.(2) We further show that M&M can be used successfully to progress through a curriculum of architectural variants defining an agents internal state. (3) Finally, we illustrate how a variant of our method can be used to improve agent performance in a multitask setting.
Learning to Explore via Meta-Policy Gradient
Tianbing Xu · Qiang Liu · Liang Zhao · Jian Peng
The performance of off-policy learning, including deep Q-learning and deep deterministic policy gradient (DDPG), critically depends on the choice of the exploration policy. Existing exploration methods are mostly based on adding noise to the on-going actor policy and can only explore \emph{local} regions close to what the actor policy dictates. In this work, we develop a simple meta-policy gradient algorithm that allows us to adaptively learn the exploration policy in DDPG. Our algorithm allows us to train flexible exploration behaviors that are independent of the actor policy, yielding a \emph{global exploration} that significantly speeds up the learning process. With an extensive study, we show that our method significantly improves the sample-efficiency of DDPG on a variety of reinforcement learning continuous control tasks.