Track: Reinforcement Learning

Thu 13 June 9:00 - 9:20 PDT

Batch Policy Learning under Constraints

Hoang Le · Cameron Voloshin · Yisong Yue

When learning policies for real-world domains, two important questions arise: (i) how to efficiently use existing off-line, off-policy, non-optimal behavior data; and (ii) how to mediate among different competing objectives and constraints. We study the problem of batch policy learning under multiple constraints and offer a systematic solution. We first propose a flexible meta algorithm that admits any batch reinforcement learning and online learning procedure as subroutines. We then present a specific algorithmic instantiation and provide performance guarantees for the main objective and all constraints. To certify constraint satisfaction, we propose a new and simple method for off-policy policy evaluation (OPE) and derive PAC-style bounds. Our algorithm achieves strong empirical results in different domains, including in a challenging problem of simulated car driving subject to lane keeping and smooth driving constraints. We also show experimentally that our OPE method outperforms other popular OPE techniques on a standalone basis, especially in a high-dimensional setting.

Thu 13 June 9:20 - 9:25 PDT

Quantifying Generalization in Reinforcement Learning

Karl Cobbe · Oleg Klimov · Chris Hesse · Taehoon Kim · John Schulman

In this paper, we investigate the problem of overfitting in deep reinforcement learning. Among the most common benchmarks in RL, it is customary to use the same environments for both training and testing. This practice offers relatively little insight into an agent's ability to generalize. We address this issue by using procedurally generated environments to construct distinct training and test sets. Most notably, we introduce a new environment called CoinRun, designed as a benchmark for generalization in RL. Using CoinRun, we find that agents overfit to surprisingly large training sets. We then show that deeper convolutional architectures improve generalization, as do methods traditionally found in supervised learning, including L2 regularization, dropout, data augmentation and batch normalization.

Thu 13 June 9:25 - 9:30 PDT

Learning Latent Dynamics for Planning from Pixels

Danijar Hafner · Timothy Lillicrap · Ian Fischer · Ruben Villegas · David Ha · Honglak Lee · James Davidson

Planning has been very successful for control tasks with known environment dynamics. To leverage planning in unknown environments, the agent needs to learn the dynamics from interactions with the world. However, learning dynamics models that are accurate enough for planning has been a long-standing challenge, especially in image-based domains. We propose the Deep Planning Network (PlaNet), a purely model-based agent that learns the environment dynamics from images and chooses actions through fast online planning in latent space. To achieve high performance, the dynamics model must accurately predict the rewards ahead for multiple time steps. We approach this problem using a latent dynamics model with both deterministic and stochastic transition components and a multi-step variational inference objective that we call latent overshooting. Using only pixel observations, our agent solves continuous control tasks with contact dynamics, partial observability, and sparse rewards, which exceed the difficulty of tasks that were previously solved by planning with learned models. PlaNet uses substantially fewer episodes and reaches final performance close to and sometimes higher than strong model-free algorithms.

Thu 13 June 9:30 - 9:35 PDT

Projections for Approximate Policy Iteration Algorithms

Riad Akrour · Joni Pajarinen · Jan Peters · Gerhard Neumann

Approximate policy iteration is a class of reinforcement learning algorithms where both the value function and policy are encoded using function approximators and which has been especially prominent in continuous action spaces. However, by encoding the policy with a function approximator, it often becomes necessary to constrain the change in action distribution during policy update to ensure increase of the policy return. Several approximations exist in the literature to solve this constrained policy update problem. In this paper, we propose to improve over such solutions by introducing a set of projections that transform the constrained problem into an unconstrained one which is then solved by standard gradient descent. Using these projections, we empirically demonstrate that our approach can both improve the policy update solution and the control over exploration of existing approximate policy iteration algorithms.

Thu 13 June 9:35 - 9:40 PDT

Learning Structured Decision Problems with Unawareness

Craig Innes · Alex Lascarides

Structured models of decision making often assume an agent is aware of all possible states and actions in advance. This assumption is sometimes untenable. In this paper, we learn Bayesian Decision Networks from both domain exploration and expert assertions in a way which guarantees convergence to optimal behaviour, even when the agent starts unaware of actions or belief variables that are critical to success. Our experiments show that our agent learns optimal behaviour on both small and large decision problems, and that allowing an agent to conserve information upon making new discoveries results in faster convergence.

Thu 13 June 9:40 - 10:00 PDT

Calibrated Model-Based Deep Reinforcement Learning

Ali Malik · Volodymyr Kuleshov · Jiaming Song · Danny Nemer · Harlan Seymour · Stefano Ermon

Accurate estimates of predictive uncertainty are important for building effective model-based reinforcement learning agents. However, predictive uncertainties --- especially ones derived from modern neural networks --- are often inaccurate and impose a bottleneck on performance. Here, we argue that ideal model uncertainties should be calibrated, i.e. their probabilities should match empirical frequencies of predicted events. We describe a simple way to augment any model-based reinforcement learning algorithm with calibrated uncertainties and show that doing so consistently improves the accuracy of planning and helps agents balance exploration and exploitation. On the HalfCheetah MuJoCo task, our system achieves state-of-the-art performance using 50\% fewer samples than the current leading approach. Our findings suggest that calibration can improve the performance and sample complexity of model-based reinforcement learning with minimal computational and implementation overhead.

Thu 13 June 10:00 - 10:05 PDT

Reinforcement Learning in Configurable Continuous Environments

Alberto Maria Metelli · Emanuele Ghelfi · Marcello Restelli

Configurable Markov Decision Processes (Conf-MDPs) have been recently introduced as an extension of the usual MDP model to account for the possibility of configuring the environment to improve the agent's performance. Currently, there is still no suitable algorithm to solve the learning problem for real-world Conf-MDPs. In this paper, we fill this gap by proposing a trust-region method, Relative Entropy Model Policy Search (REMPS), able to learn both the policy and the MDP configuration in continuous domains without requiring the knowledge of the true model of the environment. After introducing our approach and providing a finite-sample analysis, we empirically evaluate REMPS on both benchmark and realistic environments by comparing our results with those of the gradient methods.

Thu 13 June 10:05 - 10:10 PDT

Target-Based Temporal-Difference Learning

Donghwan Lee · Niao He

The use of target networks has been a popular and key component of recent deep Q-learning algorithms for reinforcement learning, yet little is known from the theory side. In this work, we introduce a new family of target-based temporal difference (TD) learning algorithms and provide theoretical analysis on their convergences. In contrast to the standard TD-learning, target-based TD algorithms maintain two separate learning parameters--the target variable and online variable. Particularly, we introduce three members in the family, called the averaging TD, double TD, and periodic TD, where the target variable is updated through an averaging, symmetric, or periodic fashion, mirroring that used in recent deep Q-learning, respectively.

We establish an asymptotic convergence analysis for both averaging TD and double TD algorithms and a finite sample analysis for the periodic TD algorithm. In addition, we also provide some simulation results showing potentially superior convergence of the target-based TD algorithms compared to the standard TD-learning. While this work is focused on linear function approximation and policy evaluation setting, we consider this as a meaningful step towards the theoretical understanding of deep Q-learning variants with target networks.

Thu 13 June 10:10 - 10:15 PDT

Iterative Linearized Control: Stable Algorithms and Complexity Guarantees

Vincent Roulet · Dmitriy Drusvyatskiy · Siddhartha Srinivasa · Zaid Harchaoui

We frame several popular iterative linearized control algorithms for nonlinear control such as ILQR and ILQG as nonlinear optimization algorithms on an objective whose first-order information can be computed using dynamic programming. Our framework allows us to identify a gradient back-propagation oracle corresponding to dynamic programming. The number of calls to this oracle is arguably the relevant measure of complexity of such algorithms in modern computing environments. We also highlight several missing components in these algorithms and propose stable and accelerated variants that enjoy worst-case complexity guarantees from a nonlinear optimization viewpoint.

Thu 13 June 10:15 - 10:20 PDT

Finding Options that Minimize Planning Time

Yuu Jinnai · David Abel · David Hershkowitz · Michael L. Littman · George Konidaris

We formalize the problem of selecting the optimal set of options for planning as that of computing the smallest set of options so that planning converges in less than a given maximum of value-iteration passes. We first show that the problem is NP-hard, even if the task is constrained to be deterministic---the first such complexity result for option discovery. We then present the first polynomial-time boundedly suboptimal approximation algorithm for this setting, and empirically evaluate it against both the optimal options and a representative collection of heuristic approaches in simple grid-based domains including the classic four-rooms problem.

Main Navigation

Session

Reinforcement Learning