Track: Deep RL 2

Tue 11 June 14:00 - 14:20 PDT

An Investigation of Model-Free Planning

Arthur Guez · Mehdi Mirza · Karol Gregor · Rishabh Kabra · Sebastien Racaniere · Theophane Weber · David Raposo · Adam Santoro · Laurent Orseau · Tom Eccles · Greg Wayne · David Silver · Timothy Lillicrap

The field of reinforcement learning (RL) is facing increasingly challenging domains with combinatorial complexity. For an RL agent to address these challenges, it is essential that it can plan effectively. Prior work has typically utilized an explicit model of the environment, combined with a specific planning algorithm (such as tree search). More recently, a new family of methods have been proposed that learn how to plan, by providing the structure for planning via an inductive bias in the function approximator (such as a tree structured neural network), trained end-to-end by a model-free RL algorithm. In this paper, we go even further, and demonstrate empirically that an entirely model-free approach, without special structure beyond standard neural network components such as convolutional networks and LSTMs, can learn to exhibit many of the characteristics typically associated with a model-based planner. We measure our agent's effectiveness at planning in terms of its ability to generalize across a combinatorial and irreversible state space, its data efficiency, and its ability to utilize additional thinking time. We find that our agent has many of the characteristics that one might expect to find in a planning algorithm. Furthermore, it exceeds the state-of-the-art in challenging combinatorial domains such as Sokoban and outperforms other model-free approaches that utilize strong inductive biases toward planning.

Tue 11 June 14:20 - 14:25 PDT

CURIOUS: Intrinsically Motivated Modular Multi-Goal Reinforcement Learning

Cédric Colas · Pierre-Yves Oudeyer · Olivier Sigaud · Pierre Fournier · Mohamed Chetouani

In open-ended and changing environments, agents face a wide range of potential tasks that might not come with associated reward functions. Such autonomous learning agents must set their own tasks and build their own curriculum through an intrinsically motivated exploration. Because some tasks might prove easy and some impossible, agents must actively select which task to practice at any given moment to maximize their overall mastery on the set of learnable tasks. This paper proposes CURIOUS, an algorithm that leverages: 1) an extension of Universal Value Function Approximators to achieve within a unique policy, multiple tasks, each parameterized by multiple goals and 2) an automated curriculum learning mechanism that biases the attention of the agent towards tasks maximizing the absolute learning progress. Agents focus on achievable tasks first, and focus back on tasks that are being forgotten. Experiments conducted in a new multi-task multi-goal robotic environment show that our algorithm benefits from these two ideas and demonstrate properties of robustness to distracting tasks, forgetting and changes in body properties.

Tue 11 June 14:25 - 14:30 PDT

Task-Agnostic Dynamics Priors for Deep Reinforcement Learning

Yilun Du · Karthik Narasimhan

While model-based deep reinforcement learning(RL) holds great promise for sample efficiency and generalization, learning an accurate dynamics model is often challenging and requires substantial interaction with the environment. A wide variety of domains have dynamics that share common foundations like the laws of physics, which are rarely exploited by existing algorithms. In fact, humans continuously acquire and use such dynamics priors to easily adapt to operating in new environments. In this work, we propose an approach to learn task-agnostic dynamics priors from videos and incorporate them into an RL agent. Our method involves pre-training a frame predictor on generic task-agnostic physics videos to initialize dynamics models (and fine-tune them)for unseen target environments. Our frame prediction architecture, SpatialNet, is designed specifically to capture localized physical phenomena and interactions. Our approach allows for both faster policy learning and convergence to better policies, outperforming competitive approaches on several different domains. We also demonstrate that incorporating this prior allows for more effective transfer learning between environments.

Tue 11 June 14:30 - 14:35 PDT

Diagnosing Bottlenecks in Deep Q-learning Algorithms

Justin Fu · Aviral Kumar · Matthew Soh · Sergey Levine

Q-learning methods represent a commonly used class of algorithms in reinforcement learning: they are generally efficient and simple, and can be combined readily with function approximators for deep reinforcement learning. However, the behavior of Q-learning methods with function approximation is poorly understood, both theoretically and empirically. In this work, we aim to experimentally investigate potential issues in Q-learning, by means of a "unit testing" framework where we can utilize oracles to disentangle sources of error. Specifically, we investigate questions related to convergence, function approximation, sampling error and nonstationarity, and where available, verify if trends found in oracle settings hold true with modern deep RL methods. We find that large neural network architectures have many benefits with regards to learning stability; offer several practical compensations for overfitting; and develop a novel sampling method based on explicitly compensating for function approximation error that yields significant improvement on high-dimensional continuous control domains.

Tue 11 June 14:35 - 14:40 PDT

Collaborative Evolutionary Reinforcement Learning

Shauharda Khadka · Somdeb Majumdar · Tarek Nassar · Zach Dwiel · Evren Tumer · Santiago Miret · Yinyin Liu · Kagan Tumer

Deep reinforcement learning algorithms have been successfully applied to a range of challenging control tasks. However, these methods typically struggle with achieving effective exploration and are extremely sensitive to the choice of hyperparameters. One reason is that most approaches use a noisy version of their operating policy to explore - thereby limiting the range of exploration. In this paper, we introduce Collaborative Evolutionary Reinforcement Learning (CERL), a scalable framework that comprises a portfolio of policies that simultaneously explore and exploit diverse regions of the solution space. A collection of learners - typically proven algorithms like TD3 - optimize over varying time-horizons leading to this diverse portfolio. All learners contribute to and use a shared replay buffer to achieve greater sample efficiency. Computational resources are dynamically distributed to favor the best learners as a form of online algorithm selection. Neuroevolution binds this entire process to generate a single emergent learner that exceeds the capabilities of any individual learner. Experiments in a range of continuous control benchmarks demonstrate that the emergent learner significantly outperforms its composite learners while remaining overall more sample-efficient - notably solving the Mujoco Humanoid benchmark where all of its composite learners (TD3) fail entirely in isolation.

Tue 11 June 14:40 - 15:00 PDT

EMI: Exploration with Mutual Information

Hyoungseok Kim · Jaekyeom Kim · Yeonwoo Jeong · Sergey Levine · Hyun Oh Song

Reinforcement learning algorithms struggle when the reward signal is very sparse. In these cases, naive random exploration methods essentially rely on a random walk to stumble onto a rewarding state. Recent works utilize intrinsic motivation to guide the exploration via generative models, predictive forward models, or discriminative modeling of novelty. We propose EMI, which is an exploration method that constructs embedding representation of states and actions that does not rely on generative decoding of the full observation but extracts predictive signals that can be used to guide exploration based on forward prediction in the representation space. Our experiments show that the proposed method significantly outperforms a number of existing exploration methods on challenging locomotion task with continuous control and on image-based exploration tasks with discrete actions on Atari.

Tue 11 June 15:00 - 15:05 PDT

Imitation Learning from Imperfect Demonstration

Yueh-Hua Wu · Nontawat Charoenphakdee · Han Bao · Voot Tangkaratt · Masashi Sugiyama

Imitation learning (IL) aims to learn an optimal policy from demonstrations. However, such demonstrations are often imperfect since collecting optimal ones is costly. To effectively learn from imperfect demonstrations, we propose a novel approach that utilizes confidence scores, which describe the quality of demonstrations. More specifically, we propose two confidence-based IL methods, namely two-step importance weighting IL (2IWIL) and generative adversarial IL with imperfect demonstration and confidence (IC-GAIL). We show that confidence scores given only to a small portion of sub-optimal demonstrations significantly improve the performance of IL both theoretically and empirically.

Tue 11 June 15:05 - 15:10 PDT

Curiosity-Bottleneck: Exploration By Distilling Task-Specific Novelty

Youngjin Kim · Daniel Nam · Hyunwoo Kim · Ji-Hoon Kim · Gunhee Kim

Exploration based on state novelty has brought great success in challenging reinforcement learning problems with sparse rewards. However, existing novelty-based strategies become inefficient in real-world problems where observation contains not only task-dependent state novelty of our interest but also task-irrelevant information that should be ignored. We introduce an information-theoretic exploration strategy named Curiosity-Bottleneck that distills task-relevant information from observation. Based on the Information Bottleneck principle, our exploration bonus is quantified as the compressiveness of observation with respect to the learned representation of a compressive value network. With extensive experiments on static image classification, grid-world and three hard-exploration Atari games, we show that Curiosity-Bottleneck learns effective exploration strategy by robustly measuring the state novelty in distractive environment where state-of-the-art exploration methods often degenerate.

Tue 11 June 15:10 - 15:15 PDT

Dynamic Weights in Multi-Objective Deep Reinforcement Learning

Axel Abels · Diederik Roijers · Tom Lenaerts · Ann Nowé · Denis Steckelmacher

Many real-world decision problems are characterized by multiple conflicting objectives which must be balanced based on their relative importance. In the dynamic weights setting the relative importance changes over time and specialized algorithms that deal with such change, such as a tabular Reinforcement Learning (RL) algorithm by Natarajan and Tadepalli (2005), are required. However, this earlier work is not feasible for RL settings that necessitate the use of function approximators. We generalize across weight changes and high-dimensional inputs by proposing a multi-objective Q-network whose outputs are conditioned on the relative importance of objectives and we introduce Diverse Experience Replay (DER) to counter the inherent non-stationarity of the Dynamic Weights setting. We perform an extensive experimental evaluation and compare our methods to adapted algorithms from Deep Multi-Task/Multi-Objective Reinforcement Learning and show that our proposed network in combination with DER dominates these adapted algorithms across weight change scenarios and problem domains.

Tue 11 June 15:15 - 15:20 PDT

Fingerprint Policy Optimisation for Robust Reinforcement Learning

Supratik Paul · Michael A Osborne · Shimon Whiteson

Policy gradient methods ignore the potential value of adjusting environment variables: unobservable state features that are randomly determined by the environment in a physical setting, but are controllable in a simulator. This can lead to slow learning, or convergence to suboptimal policies, if the environment variable has a large impact on the transition dynamics. In this paper, we present fingerprint policy optimisation (FPO), which finds a policy that is optimal in expectation across the distribution of environment variables. The central idea is to use Bayesian optimisation (BO) to actively select the distribution of the environment variable that maximises the improvement generated by each iteration of the policy gradient method. To make this BO practical, we contribute two easy-to-compute low-dimensional fingerprints of the current policy. Our experiments show that FPO can efficiently learn policies that are robust to significant rare events, which are unlikely to be observable under random sampling, but are key to learning good policies.

Main Navigation

Session

Deep RL 2