Track: Deep RL

Wed 12 June 14:00 - 14:20 PDT

The Natural Language of Actions

Guy Tennenholtz · Shie Mannor

We introduce Act2Vec, a general framework for learning context-based action representation for Reinforcement Learning. Representing actions in a vector space help reinforcement learning algorithms achieve better performance by grouping similar actions and utilizing relations between different actions. We show how prior knowledge of an environment can be extracted from demonstrations and injected into action vector representations that encode natural compatible behavior. We then use these for augmenting state representations as well as improving function approximation of Q-values. We visualize and test action embeddings in three domains including a drawing task, a high dimensional navigation task, and the large action space domain of StarCraft II.

Wed 12 June 14:20 - 14:25 PDT

Control Regularization for Reduced Variance Reinforcement Learning

Richard Cheng · Abhinav Verma · Gabor Orosz · Swarat Chaudhuri · Yisong Yue · Joel Burdick

Dealing with high variance is a significant challenge in model-free reinforcement learning (RL). Existing methods are unreliable, exhibiting high variance in performance from run to run using different initializations/seeds. Focusing on problems arising in continuous control, we propose a functional regularization approach to augmenting model-free RL. In particular, we regularize the behavior of the deep policy to be similar to a control prior, i.e., we regularize in function space. We show that functional regularization yields a bias-variance trade-off, and propose an adaptive tuning strategy to optimize this trade-off. When the prior policy has control-theoretic stability guarantees, we further show that this regularization approximately preserves those stability guarantees throughout learning. We validate our approach empirically on a wide range of settings, and demonstrate significantly reduced variance, guaranteed dynamic stability, and more efficient learning than deep RL alone.

Wed 12 June 14:25 - 14:30 PDT

On the Generalization Gap in Reparameterizable Reinforcement Learning

Huan Wang · Stephan Zheng · Caiming Xiong · Richard Socher

Understanding generalization in reinforcement learning (RL) is a significant challenge, as many common assumptions of traditional supervised learning theory do not apply. We argue that the gap between training and testing performance of RL agents is caused by two types of errors: intrinsic error due to the randomness of the environment and an agent's policy, and external error by the change of environment distribution. We focus on the special class of reparameterizable RL problems, where the trajectory distribution can be decomposed using the reparametrization trick. For this problem class, estimating the expected reward is efficient and does not require costly trajectory re-sampling. This enables us to study reparametrizable RL using supervised learning and transfer learning theory. Our bound suggests the generalization capability of reparameterizable RL is related to multiple factors including ``smoothness" of the environment transition, reward and agent policy function class. We also empirically verify the relationship between the generalization gap and these factors through simulations.

Wed 12 June 14:30 - 14:35 PDT

Trajectory-Based Off-Policy Deep Reinforcement Learning

Andreas Doerr · Michael Volpp · Marc Toussaint · Sebastian Trimpe · Christian Daniel

Policy gradient methods are powerful reinforcement learning algorithms and have been demonstrated to solve many complex tasks. However, these methods are also data-inefficient, afflicted with high variance gradient estimates, and get frequently stuck in local optima. This work addresses these weaknesses by combining recent improvements in the reuse of off-policy data and exploration in parameter space with deterministic behavioral policies. The resulting objective is amenable to standard neural network optimization strategies, like stochastic gradient descent or stochastic gradient Hamiltonian Monte Carlo. Incorporation of previous rollouts via importance sampling greatly improves data efficiency, whilst stochastic optimization schemes facilitate the escape from local optima. We evaluate the proposed approach on a series of continuous control benchmark tasks. The results show that the proposed algorithm is able to successfully and reliably learn solutions using fewer system interactions than standard policy gradient methods.

Wed 12 June 14:35 - 14:40 PDT

A Deep Reinforcement Learning Perspective on Internet Congestion Control

Nathan Jay · Noga H. Rotman · Brighten Godfrey · Michael Schapira · Aviv Tamar

We present and investigate a novel and timely application domain for deep reinforcement learn-ing (RL): Internet congestion control. Congestion control is the core networking task of modulating traffic sources’ data-transmission rates so as to efficiently utilize network capacity. Congestion control is fundamental to computer networking research and practice, and has recently been the subject of extensive attention in light of the advent of Internet services such as live video, augmented and virtual reality, Internet-of-Things, and more. We show that casting congestion control as an RL task enables the training of deep network policies that capture intricate patterns in data traffic and network conditions, and leveraging this to outperform state-of-the-art congestion control schemes.Alongside these promising positive results, we also highlight significant challenges facing real-world adoption of RL-based congestion control solutions, such as fairness, safety, and generalization, which are not trivial to address within conventional RL formalism. To facilitate further research of these challenges and reproducibility of our results, we present a test suite for RL-guided congestion control based on the OpenAI Gym interface.

Wed 12 June 14:40 - 15:00 PDT

Model-Based Active Exploration

Pranav Shyam · Wojciech Jaśkowski · Faustino Gomez

Efficient exploration is an unsolved problem in Reinforcement Learning which is usually addressed by reactively rewarding the agent for fortuitously encountering novel situations. This paper introduces an efficient active exploration algorithm, Model-Based Active eXploration (MAX), which uses an ensemble of forward models to plan to observe novel events, where novelty is assessed by measuring the potential disagreement between ensemble members using a principled criterion derived from the Bayesian perspective. We show empirically that in semi-random discrete environments where directed exploration is critical to make progress, MAX is at least an order of magnitude more efficient than strong baselines. MAX also scales to high-dimensional continuous environments where it builds task-agnostic models that can be used for any downstream task.

Wed 12 June 15:00 - 15:05 PDT

Extrapolating Beyond Suboptimal Demonstrations via Inverse Reinforcement Learning from Observations

Daniel Brown · Wonjoon Goo · Prabhat Nagarajan · Scott Niekum

A critical flaw of existing inverse reinforcement learning (IRL) methods is their inability to significantly outperform the demonstrator. This is a consequence of the general reliance of IRL algorithms upon some form of mimicry, such as feature-count matching, rather than inferring the underlying intentions of the demonstrator that may have been poorly executed in practice. In this paper, we introduce a novel reward learning from observation algorithm, Trajectory-ranked Reward EXtrapolation (T-REX), that extrapolates beyond a set of (approximately) ranked demonstrations in order to infer high-quality reward functions from a set of potentially poor demonstrations. When combined with deep reinforcement learning, we show that this approach can achieve performance that is more than an order of magnitude better than the best-performing demonstration, as well as a state-of-the-art behavioral cloning from observation method, on multiple Atari and MuJoCo benchmark tasks. Finally, we demonstrate that T-REX is robust to modest amounts of ranking noise, opening up future possibilities for automating the ranking process, for example, by watching a learner noisily improve at a task over time.

Wed 12 June 15:05 - 15:10 PDT

Distributional Multivariate Policy Evaluation and Exploration with the Bellman GAN

dror freirich · Tzahi Shimkin · Ron Meir · Aviv Tamar

The recently proposed distributional approach to reinforcement learning (DiRL) is centered on learning the distribution of the reward-to-go, often referred to as the value distribution. In this work, we show that the distributional Bellman equation, which drives DiRL methods, is equivalent to a generative adversarial network (GAN) model. In this formulation, DiRL can be seen as learning a deep generative model of the value distribution, driven by the discrepancy between the distribution of the current value, and the distribution of the sum of current reward and next value. We use this insight to propose a GAN-based approach to DiRL, which leverages the strengths of GANs in learning distributions of high dimensional data. In particular, we show that our GAN approach can be used for DiRL with multivariate rewards, an important setting which cannot be tackled with prior methods. The multivariate setting also allows us to unify learning the distribution of values and state transitions, and we exploit this idea to devise a novel exploration method that is driven by the discrepancy in estimating both values and states.

Wed 12 June 15:10 - 15:15 PDT

A Baseline for Any Order Gradient Estimation in Stochastic Computation Graphs

Jingkai Mao · Jakob Foerster · Tim Rocktäschel · Maruan Al-Shedivat · Gregory Farquhar · Shimon Whiteson

By enabling correct differentiation in Stochastic Computation Graphs (SCGs), the infinitely differentiable Monte-Carlo estimator (DiCE) can generate correct estimates for the higher order gradients that arise in, e.g., multi-agent reinforcement learning and meta-learning. However, the baseline term in DiCE that serves as a control variate for reducing variance applies only to first order gradient estimation, limiting the utility of higher-order gradient estimates. To improve the sample efficiency of DiCE, we propose a new baseline term for higher order gradient estimation. This term may be easily included in the objective, and produces unbiased variance-reduced estimators under (automatic) differentiation, without affecting the estimate of the objective itself or of the first order gradient estimate. It reuses the same baseline function (e.g., the state-value function in reinforcement learning) already used for the first order baseline. We provide theoretical analysis and numerical evaluations of this new baseline, which demonstrate that it can dramatically reduce the variance of DiCE's second order gradient estimators and also show empirically that it reduces the variance of third and fourth order gradients. This computational tool can be easily used to estimate higher order gradients with unprecedented efficiency and simplicity wherever automatic differentiation is utilised, and it has the potential to unlock applications of higher order gradients in reinforcement learning and meta-learning.

Wed 12 June 15:15 - 15:20 PDT

Remember and Forget for Experience Replay

Guido Novati · Petros Koumoutsakos

Experience replay (ER) is a fundamental component of off-policy deep reinforcement learning (RL). ER recalls experiences from past iterations to compute gradient estimates for the current policy, increasing data-efficiency. However, the accuracy of such updates may deteriorate when the policy diverges from past behaviors and can undermine the performance of ER. Many algorithms mitigate this issue by tuning hyper-parameters to slow down policy changes. An alternative is to actively manage the experiences in the replay memory. We introduce Remember and Forget Experience Replay (ReF-ER), a novel method that can enhance RL algorithms with parameterized policies. ReF-ER (1) skips gradients computed from experiences that are too unlikely with the current policy and (2) regulates policy changes within a trust region of the replayed behaviors. We couple ReF-ER with Q-learning, deterministic policy gradient and off-policy gradient methods. We find that ReF-ER consistently improves the performance of continuous-action, off-policy RL on fully observable benchmarks and partially observable flow control problems.

Main Navigation

Session

Deep RL