Track: Deep RL 1

Tue 11 June 11:00 - 11:20 PDT

ELF OpenGo: an analysis and open reimplementation of AlphaZero

Yuandong Tian · Jerry Ma · Qucheng Gong · Shubho Sengupta · Zhuoyuan Chen · James Pinkerton · Larry Zitnick

The AlphaGo, AlphaGo Zero, and AlphaZero series of algorithms are remarkable demonstrations of deep reinforcement learning's capabilities, achieving superhuman performance in the complex game of Go with progressively increasing autonomy. However, many obstacles remain in the understanding of and usability of these promising approaches by the research community. Toward elucidating unresolved mysteries and facilitating future research, we propose ELF OpenGo, an open-source reimplementation of the AlphaZero algorithm. ELF OpenGo is the first open-source Go AI to convincingly demonstrate superhuman performance with a perfect (20:0) record against global top professionals. We apply ELF OpenGo to conduct extensive ablation studies, and to identify and analyze numerous interesting phenomena in both the model training and in the gameplay inference procedures. Our code, models, selfplay datasets, and auxiliary data are publicly available.

Tue 11 June 11:20 - 11:25 PDT

Making Deep Q-learning methods robust to time discretization

Corentin Tallec · Leonard Blier · Yann Ollivier

Despite remarkable successes, Deep Reinforce- ment Learning (DRL) is not robust to hyperparam- eterization, implementation details, or small envi- ronment changes (Henderson et al. 2017, Zhang et al. 2018). Overcoming such sensitivity is key to making DRL applicable to real world problems. In this paper, we identify sensitivity to time dis- cretization in near continuous-time environments as a critical factor; this covers, e.g., changing the number of frames per second, or the action frequency of the controller. Empirically, we find that Q-learning-based approaches such as Deep Q- learning (Mnih et al., 2015) and Deep Determinis- tic Policy Gradient (Lillicrap et al., 2015) collapse with small time steps. Formally, we prove that Q-learning does not exist in continuous time. We detail a principled way to build an off-policy RL algorithm that yields similar performances over a wide range of time discretizations, and confirm this robustness empirically.

Tue 11 June 11:25 - 11:30 PDT

Nonlinear Distributional Gradient Temporal-Difference Learning

chao qu · Shie Mannor · Huan Xu

We devise a distributional variant of gradient temporal-difference (TD) learning. Distributional reinforcement learning has been demonstrated to outperform the regular one in the recent study \citep{bellemare2017distributional}. In the policy evaluation setting, we design two new algorithms called distributional GTD2 and distributional TDC using the Cram{\'e}r distance on the distributional version of the Bellman error objective function, which inherits advantages of both the nonlinear gradient TD algorithms and the distributional RL approach. In the control setting, we propose the distributional Greedy-GQ using similar derivation. We prove the asymptotic almost-sure convergence of distributional GTD2 and TDC to a local optimal solution for general smooth function approximators, which includes neural networks that have been widely used in recent study to solve the real-life RL problems. In each step, the computational complexity of above three algorithms is linear w.r.t.\ the number of the parameters of the function approximator, thus can be implemented efficiently for neural networks.

Tue 11 June 11:30 - 11:35 PDT

Composing Entropic Policies using Divergence Correction

Jonathan Hunt · Andre Barreto · Timothy Lillicrap · Nicolas Heess

Deep reinforcement learning algorithms have achieved remarkable successes, but often require vast amounts of experience to solve a task. Composing skills mastered in one task in order to efficiently solve novel challenges promises dramatic improvements in data efficiency. Here, we build on two recent works composing behaviors represented in the form of action-value functions. We analyze prior methods and show that they perform poorly in some situations. As part of this analysis, we extend an important generalization of policy improvement to the maximum entropy framework and introduce an algorithm for the practical implementation of successor features in continuous action spaces. Then we propose a novel approach which addresses the failure cases of prior work and, in principle, recovers the optimal policy during transfer. This method works by explicitly learning the (discounted, future) divergence between base policies. We study this approach in the tabular case and on non-trivial continuous control problems with compositional structure and show that it outperforms or matches existing methods across all tasks considered.

Tue 11 June 11:35 - 11:40 PDT

TibGM: A Transferable and Information-Based Graphical Model Approach for Reinforcement Learning

Tameem Adel · Adrian Weller

Hierarchical reinforcement learning (HRL) can provide a principled solution to the RL challenge of scalability for complex tasks. By incorporating a graphical model (GM) and the rich family of related methods, there is also hope to address issues such as transferability, generalisation and exploration. Here we propose a flexible GM-based HRL framework which leverages efficient inference procedures to enhance generalisation and transfer power. In our proposed transferable and information-based graphical model framework ‘TibGM’, we show the equivalence between our mutual information-based objective in the GM, and an RL consolidated objective consisting of a standard reward maximisation target and a generalisation/transfer objective. In settings where there is a sparse or deceptive reward signal, our TibGM framework is flexible enough to incorporate exploration bonuses depicting intrinsic rewards. We empirically verify improved performance and exploration power.

Tue 11 June 11:40 - 12:00 PDT

Multi-Agent Adversarial Inverse Reinforcement Learning

Lantao Yu · Jiaming Song · Stefano Ermon

Reinforcement learning agents are prone to undesired behaviors due to reward mis-specification. Finding a set of reward functions to properly guide agent behaviors is particularly challenging in multi-agent scenarios. Inverse reinforcement learning provides a framework to automatically acquire suitable reward functions from expert demonstrations. Its extension to multi-agent settings, however, is difficult due to the more complex notions of rational behaviors. In this paper, we propose MA-AIRL, a new framework for multi-agent inverse reinforcement learning, which is effective and scalable for Markov games with high-dimensional state-action space and unknown dynamics. We derive our algorithm based on a new solution concept and maximum pseudolikelihood estimation within an adversarial reward learning framework. In the experiments, we demonstrate that MA-AIRL can recover reward functions that are highly correlated with the ground truth rewards, while significantly outperforms prior methods in terms of policy imitation.

Tue 11 June 12:00 - 12:05 PDT

Policy Consolidation for Continual Reinforcement Learning

Christos Kaplanis · Murray Shanahan · Claudia Clopath

We propose a method for tackling catastrophic forgetting in deep reinforcement learning that is \textit{agnostic} to the timescale of changes in the distribution of experiences, does not require knowledge of task boundaries and can adapt in \textit{continuously} changing environments. In our \textit{policy consolidation} model, the policy network interacts with a cascade of hidden networks that simultaneously remember the agent's policy at a range of timescales and regularise the current policy by its own history, thereby improving its ability to learn without forgetting. We find that the model improves continual learning relative to baselines on a number of continuous control tasks in single-task, alternating two-task, and multi-agent competitive self-play settings.

Tue 11 June 12:05 - 12:10 PDT

Off-Policy Deep Reinforcement Learning without Exploration

Scott Fujimoto · David Meger · Doina Precup

Many practical applications of reinforcement learning constrain agents to learn from a fixed batch of data which has already been gathered, without offering further possibility for data collection. In this paper, we demonstrate that due to errors introduced by extrapolation, standard off-policy deep reinforcement learning algorithms, such as DQN and DDPG, are incapable of learning with data uncorrelated to the distribution under the current policy, making them ineffective for this fixed batch setting. We introduce a novel class of off-policy algorithms, batch-constrained reinforcement learning, which restricts the action space in order to force the agent towards behaving close to on-policy with respect to a subset of the given data. We present the first continuous control deep reinforcement learning algorithm which can learn effectively from arbitrary, fixed batch data, and empirically demonstrate the quality of its behavior in several tasks.

Tue 11 June 12:10 - 12:15 PDT

Random Expert Distillation: Imitation Learning via Expert Policy Support Estimation

Ruohan Wang · Carlo Ciliberto · Pierluigi Vito Amadori · Yiannis Demiris

We consider the problem of imitation learning from a finite set of expert trajectories, without access to reinforcement signals. The classical approach of extracting the expert's reward function via inverse reinforcement learning, followed by reinforcement learning is indirect and may be computationally expensive. Recent methods based on generative adversarial networks or generative moment matching formulate the task as distribution matching between the expert policy and the learned policy. However, training via distribution matching could be unstable. We propose a new framework for imitation learning based on estimating the support of the expert policy to compute a fixed reward function from the expert trajectories. This allows us to re-frame imitation learning within the standard reinforcement learning setting. We demonstrate the efficacy of our reward function on both discrete and continuous domains. The policies learned using different reinforcement learning methods with the proposed reward function achieve comparable or better performance than other imitation learning methods.

Tue 11 June 12:15 - 12:20 PDT

Revisiting the Softmax Bellman Operator: New Benefits and New Perspective

Zhao Song · Ron Parr · Lawrence Carin

The impact of softmax on the value function itself in reinforcement learning (RL) is often viewed as problematic because it leads to sub-optimal value (or Q) functions and interferes with the contraction properties of the Bellman operator. Surprisingly, despite these concerns, and {\em independent of its effect on exploration}, the softmax Bellman operator when combined with Deep Q-learning, leads to Q-functions with superior policies in practice, even outperforming its double Q-learning counterpart. To better understand how and why this occurs, we revisit theoretical properties of the softmax Bellman operator, and prove that $(i)$ it converges to the standard Bellman operator exponentially fast in the inverse temperature parameter, and $(ii)$ the distance of its Q function from the optimal one can be bounded. These alone do not explain its superior performance, so we also show that the softmax operator can reduce the overestimation error, which may give some insight into why a sub-optimal operator leads to better performance in the presence of value function approximation. A comparison among different Bellman operators is then presented, showing the trade-offs when selecting them.

Main Navigation

Session

Deep RL 1