Track: Deep RL

Wed 12 June 11:00 - 11:20 PDT

Social Influence as Intrinsic Motivation for Multi-Agent Deep Reinforcement Learning

Natasha Jaques · Angeliki Lazaridou · Edward Hughes · Caglar Gulcehre · Pedro Ortega · DJ Strouse · Joel Z Leibo · Nando de Freitas

We propose a unified mechanism for achieving coordination and communication in Multi-Agent Reinforcement Learning (MARL), through rewarding agents for having causal influence over other agents' actions. Causal influence is assessed using counterfactual reasoning. At each timestep, an agent simulates alternate actions that it could have taken, and computes their effect on the behavior of other agents. Actions that lead to bigger changes in other agents' behavior are considered influential and are rewarded. We show that this is equivalent to rewarding agents for having high mutual information between their actions. Empirical results demonstrate that influence leads to enhanced coordination and communication in challenging social dilemma environments, dramatically increasing the learning curves of the deep RL agents, and leading to more meaningful learned communication protocols. The influence rewards for all agents can be computed in a decentralized way by enabling agents to learn a model of other agents using deep neural networks. In contrast, key previous works on emergent communication in the MARL setting were unable to learn diverse policies in a decentralized manner and had to resort to centralized training. Consequently, the influence reward opens up a window of new opportunities for research in this area.

Wed 12 June 11:20 - 11:25 PDT

Maximum Entropy-Regularized Multi-Goal Reinforcement Learning

Rui Zhao · Xudong Sun · Volker Tresp

In Multi-Goal Reinforcement Learning, an agent learns to achieve multiple goals with a goal-conditioned policy. During learning, the agent first collects the trajectories into a replay buffer and later these trajectories are selected randomly for replay. However, the achieved goals in the replay buffer are often biased towards the behavior policies. From a Bayesian perspective, when there is no prior knowledge of the target goal distribution, the agent should learn uniformly from diverse achieved goals. Therefore, we first propose a novel multi-goal RL objective based on weighted entropy. This objective encourages the agent to maximize the expected return, as well as to achieve more diverse goals. Secondly, we developed a maximum entropy-based prioritization framework to optimize the proposed objective. For evaluation of this framework, we combine it with Deep Deterministic Policy Gradient, both with or without Hindsight Experience Replay. On a set of multi-goal robotic tasks in OpenAI Gym, we compare our method with other baselines and show promising improvements in both performance and sample-efficiency.

Wed 12 June 11:25 - 11:30 PDT

Imitating Latent Policies from Observation

Ashley Edwards · Himanshu Sahni · Yannick Schroecker · Charles Isbell

In this paper, we describe a novel approach to imitation learning that infers latent policies directly from state observations. We introduce a method that characterizes the causal effects of latent actions on observations while simultaneously predicting their likelihood. We then outline an action alignment procedure that leverages a small amount of environment interactions to determine a mapping between the latent and real-world actions. We show that this corrected labeling can be used for imitating the observed behavior, even though no expert actions are given. We evaluate our approach within classic control environments and a platform game and demonstrate that it performs better than standard approaches. Code and videos for this work are available in the supplementary.

Wed 12 June 11:30 - 11:35 PDT

SOLAR: Deep Structured Representations for Model-Based Reinforcement Learning

Marvin Zhang · Sharad Vikram · Laura Smith · Pieter Abbeel · Matthew Johnson · Sergey Levine

Model-based reinforcement learning (RL) has proven to be a data efficient approach for learning control tasks but is difficult to utilize in domains with complex observations such as images. In this paper, we present a method for learning representations that are suitable for iterative model-based policy improvement, in that these representations are optimized for inferring simple dynamics and cost models given the data from the current policy. This enables a model-based RL method based on the linear-quadratic regulator (LQR) to be used for systems with image observations. We evaluate our approach on a suite of robotics tasks, including manipulation tasks on a real Sawyer robot arm directly from images, and we find that our method results in better final performance than other model-based RL methods while being significantly more efficient than model-free RL.

Wed 12 June 11:35 - 11:40 PDT

Dimension-Wise Importance Sampling Weight Clipping for Sample-Efficient Reinforcement Learning

Seungyul Han · Youngchul Sung

In importance sampling (IS)-based reinforcement learning algorithms such as Proximal Policy Optimization (PPO), IS weights are typically clipped to avoid large variance in learning. However, policy update from clipped statistics induces large bias in tasks with high action dimensions, and bias from clipping makes it difficult to reuse old samples with large IS weights. In this paper, we consider PPO, a representative on-policy algorithm, and propose its improvement by dimension-wise IS weight clipping which separately clips the IS weight of each action dimension to avoid large bias and adaptively controls the IS weight to bound policy update from the current policy. This new technique enables efficient learning for high action-dimensional tasks and reusing of old samples like in off-policy learning to increase the sample efficiency. Numerical results show that the proposed new algorithm outperforms PPO and other RL algorithms in various Open AI Gym tasks.

Wed 12 June 11:40 - 12:00 PDT

Structured agents for physical construction

Victor Bapst · Alvaro Sanchez-Gonzalez · Carl Doersch · Kimberly Stachenfeld · Pushmeet Kohli · Peter Battaglia · Jessica Hamrick

Physical construction---the ability to compose objects, subject to physical dynamics, in order to serve some function---is fundamental to human intelligence. Here we introduce a suite of challenging physical construction tasks inspired by how children play with blocks, such as matching a target configuration, stacking and attaching blocks to connect objects together, and creating shelter-like structures over target objects. We then examine how a range of modern deep reinforcement learning agents fare on these challenges, and introduce several new approaches which provide superior performance. Our results show that agents which use structured representations (e.g., objects and scene graphs) and structured policies (e.g., object-centric actions) outperform those which use less structured representations, and generalize better beyond their training. Agents which use model-based planning via Monte-Carlo Tree Search also outperform strictly model-free agents in our most challenging construction problems. We conclude that approaches which combine structured representations and reasoning with powerful learning are a key path toward agents that can perform complex construction behaviors.

Wed 12 June 12:00 - 12:05 PDT

Learning Novel Policies For Tasks

Yunbo Zhang · Wenhao Yu · Greg Turk

Finding multiple distinct solutions for a particular task is a challenging problem for reinforcement learning algorithms. In this work, we present a reinforcement learning algorithm that can find a variety of policies (novel policies) for a task that is given by a task reward function. Our method does this by creating a second reward function that recognizes previously seen state sequences and rewards those by novelty. Novelty is measured using autoencoders that have been trained on state sequences from previously discovered policies. We present a two-objective update technique for policy gradient algorithms that each update of the policy is a compromise between improving the task reward and improving the novelty reward. Using this method, we end up with a collection of policies that solves a given task as well as carrying out action sequences that are distinct from one another. We demonstrate this method on maze navigation tasks, a reaching task for a simulated robot arm, and a locomotion task for a hopper. We also demonstrate the effectiveness of our approach on deceptive tasks in which policy gradient methods often get stuck.

Wed 12 June 12:05 - 12:10 PDT

Taming MAML: Efficient unbiased meta-reinforcement learning

Hao Liu · Richard Socher · Caiming Xiong

While meta reinforcement learning (Meta-RL) methods have achieved remarkable success, obtaining correct and low variance estimates for policy gradients remains a significant challenge. In particular, estimating a large Hessian, poor sample efficiency and unstable training continue to make Meta-RL difficult. We propose a surrogate objective function named, Tamed MAML (TMAML), that adds control variates into gradient estimation via automatic differentiation. TMAML improves the quality of gradient estimation by reducing variance without introducing bias. We further propose a version of our method that extends the meta-learning framework to learning the control variates themselves, enabling efficient learning from a distribution of MDPs. We empirically compare our approach with MAML and other variance-bias trade-off methods including DICE, LVC, and action-dependent control variates. Our approach is easy to implement and outperforms existing methods in terms of the variance and accuracy of gradient estimation, ultimately yielding higher performance across a variety of challenging Meta-RL environments.

Wed 12 June 12:10 - 12:15 PDT

Self-Supervised Exploration via Disagreement

Deepak Pathak · Dhiraj Gandhi · Abhinav Gupta

Exploration has been a long standing problem in both model-based and model-free learning methods for sensorimotor control. There have been major advances in recent years demonstrated in noise-free, non-stochastic domains such as video games and simulation. However, most of the current formulations get stuck when there are stochastic dynamics. In this paper, we propose a formulation for exploration inspired from the work in active learning literature. Specifically, we train an ensemble of dynamics models and incentivize the agent to maximize the disagreement or variance of those ensembles. We show that this formulation works as well as other formulations in non-stochastic scenarios, and is able to explore better in scenarios with stochastic-dynamics. Further, we show that this objective can be leveraged to perform differentiable policy optimization. This leads to a sample efficient exploration policy. We show experiments on a large number of standard environments to demonstrate the efficacy of this approach. Furthermore, we implement our exploration algorithm on a real robot which learns to interact with objects completely from scratch. Project videos are in supplementary.

Wed 12 June 12:15 - 12:20 PDT

Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Variables

Kate Rakelly · Aurick Zhou · Chelsea Finn · Sergey Levine · Deirdre Quillen

Deep reinforcement learning algorithms require large amounts of experience to learn an individual task. While in principle meta-reinforcement learning (meta-RL) algorithms enable agents to learn new skills from small amounts of experience, several major challenges preclude their practicality. Current methods rely heavily on on-policy experience, limiting their sample efficiency, and lack mechanisms to reason about task uncertainty when identifying and learning new tasks, limiting their effectiveness in sparse reward problems. In this paper, we aim to address these challenges by developing an off-policy meta-RL algorithm based on online latent task inference. Our method can be interpreted as an implementation of online probabilistic filtering of latent task variables to infer how to solve a new task from small amounts of experience. This probabilistic interpretation also enables posterior sampling for structured exploration. Our method outperforms prior algorithms in asymptotic performance and sample efficiency on several meta-RL benchmarks.

Main Navigation

Session

Deep RL