Session
Reinforcement Learning
Batch Policy Learning under Constraints
Hoang Le · Cameron Voloshin · Yisong Yue
When learning policies for real-world domains, two important questions arise: (i) how to efficiently use existing off-line, off-policy, non-optimal behavior data; and (ii) how to mediate among different competing objectives and constraints. We study the problem of batch policy learning under multiple constraints and offer a systematic solution. We first propose a flexible meta algorithm that admits any batch reinforcement learning and online learning procedure as subroutines. We then present a specific algorithmic instantiation and provide performance guarantees for the main objective and all constraints. To certify constraint satisfaction, we propose a new and simple method for off-policy policy evaluation (OPE) and derive PAC-style bounds. Our algorithm achieves strong empirical results in different domains, including in a challenging problem of simulated car driving subject to lane keeping and smooth driving constraints. We also show experimentally that our OPE method outperforms other popular OPE techniques on a standalone basis, especially in a high-dimensional setting.
Quantifying Generalization in Reinforcement Learning
Karl Cobbe · Oleg Klimov · Chris Hesse · Taehoon Kim · John Schulman
In this paper, we investigate the problem of overfitting in deep reinforcement learning. Among the most common benchmarks in RL, it is customary to use the same environments for both training and testing. This practice offers relatively little insight into an agent's ability to generalize. We address this issue by using procedurally generated environments to construct distinct training and test sets. Most notably, we introduce a new environment called CoinRun, designed as a benchmark for generalization in RL. Using CoinRun, we find that agents overfit to surprisingly large training sets. We then show that deeper convolutional architectures improve generalization, as do methods traditionally found in supervised learning, including L2 regularization, dropout, data augmentation and batch normalization.
Learning Latent Dynamics for Planning from Pixels
Danijar Hafner · Timothy Lillicrap · Ian Fischer · Ruben Villegas · David Ha · Honglak Lee · James Davidson
Planning has been very successful for control tasks with known environment dynamics. To leverage planning in unknown environments, the agent needs to learn the dynamics from interactions with the world. However, learning dynamics models that are accurate enough for planning has been a long-standing challenge, especially in image-based domains. We propose the Deep Planning Network (PlaNet), a purely model-based agent that learns the environment dynamics from images and chooses actions through fast online planning in latent space. To achieve high performance, the dynamics model must accurately predict the rewards ahead for multiple time steps. We approach this problem using a latent dynamics model with both deterministic and stochastic transition components and a multi-step variational inference objective that we call latent overshooting. Using only pixel observations, our agent solves continuous control tasks with contact dynamics, partial observability, and sparse rewards, which exceed the difficulty of tasks that were previously solved by planning with learned models. PlaNet uses substantially fewer episodes and reaches final performance close to and sometimes higher than strong model-free algorithms.
Projections for Approximate Policy Iteration Algorithms
Riad Akrour · Joni Pajarinen · Jan Peters · Gerhard Neumann
Approximate policy iteration is a class of reinforcement learning algorithms where both the value function and policy are encoded using function approximators and which has been especially prominent in continuous action spaces. However, by encoding the policy with a function approximator, it often becomes necessary to constrain the change in action distribution during policy update to ensure increase of the policy return. Several approximations exist in the literature to solve this constrained policy update problem. In this paper, we propose to improve over such solutions by introducing a set of projections that transform the constrained problem into an unconstrained one which is then solved by standard gradient descent. Using these projections, we empirically demonstrate that our approach can both improve the policy update solution and the control over exploration of existing approximate policy iteration algorithms.
Learning Structured Decision Problems with Unawareness
Craig Innes · Alex Lascarides
Structured models of decision making often assume an agent is aware of all possible states and actions in advance. This assumption is sometimes untenable. In this paper, we learn Bayesian Decision Networks from both domain exploration and expert assertions in a way which guarantees convergence to optimal behaviour, even when the agent starts unaware of actions or belief variables that are critical to success. Our experiments show that our agent learns optimal behaviour on both small and large decision problems, and that allowing an agent to conserve information upon making new discoveries results in faster convergence.
Calibrated Model-Based Deep Reinforcement Learning
Ali Malik · Volodymyr Kuleshov · Jiaming Song · Danny Nemer · Harlan Seymour · Stefano Ermon
Accurate estimates of predictive uncertainty are important for building effective model-based reinforcement learning agents. However, predictive uncertainties --- especially ones derived from modern neural networks --- are often inaccurate and impose a bottleneck on performance. Here, we argue that ideal model uncertainties should be calibrated, i.e. their probabilities should match empirical frequencies of predicted events. We describe a simple way to augment any model-based reinforcement learning algorithm with calibrated uncertainties and show that doing so consistently improves the accuracy of planning and helps agents balance exploration and exploitation. On the HalfCheetah MuJoCo task, our system achieves state-of-the-art performance using 50\% fewer samples than the current leading approach. Our findings suggest that calibration can improve the performance and sample complexity of model-based reinforcement learning with minimal computational and implementation overhead.
Reinforcement Learning in Configurable Continuous Environments
Alberto Maria Metelli · Emanuele Ghelfi · Marcello Restelli
Configurable Markov Decision Processes (Conf-MDPs) have been recently introduced as an extension of the usual MDP model to account for the possibility of configuring the environment to improve the agent's performance. Currently, there is still no suitable algorithm to solve the learning problem for real-world Conf-MDPs. In this paper, we fill this gap by proposing a trust-region method, Relative Entropy Model Policy Search (REMPS), able to learn both the policy and the MDP configuration in continuous domains without requiring the knowledge of the true model of the environment. After introducing our approach and providing a finite-sample analysis, we empirically evaluate REMPS on both benchmark and realistic environments by comparing our results with those of the gradient methods.
The use of target networks has been a popular and key component of recent deep Q-learning algorithms for reinforcement learning, yet little is known from the theory side. In this work, we introduce a new family of target-based temporal difference (TD) learning algorithms and provide theoretical analysis on their convergences. In contrast to the standard TD-learning, target-based TD algorithms maintain two separate learning parameters--the target variable and online variable. Particularly, we introduce three members in the family, called the averaging TD, double TD, and periodic TD, where the target variable is updated through an averaging, symmetric, or periodic fashion, mirroring that used in recent deep Q-learning, respectively.
We establish an asymptotic convergence analysis for both averaging TD and double TD algorithms and a finite sample analysis for the periodic TD algorithm. In addition, we also provide some simulation results showing potentially superior convergence of the target-based TD algorithms compared to the standard TD-learning. While this work is focused on linear function approximation and policy evaluation setting, we consider this as a meaningful step towards the theoretical understanding of deep Q-learning variants with target networks.
Iterative Linearized Control: Stable Algorithms and Complexity Guarantees
Vincent Roulet · Dmitriy Drusvyatskiy · Siddhartha Srinivasa · Zaid Harchaoui
We frame several popular iterative linearized control algorithms for nonlinear control such as ILQR and ILQG as nonlinear optimization algorithms on an objective whose first-order information can be computed using dynamic programming. Our framework allows us to identify a gradient back-propagation oracle corresponding to dynamic programming. The number of calls to this oracle is arguably the relevant measure of complexity of such algorithms in modern computing environments. We also highlight several missing components in these algorithms and propose stable and accelerated variants that enjoy worst-case complexity guarantees from a nonlinear optimization viewpoint.
Finding Options that Minimize Planning Time
Yuu Jinnai · David Abel · David Hershkowitz · Michael L. Littman · George Konidaris
We formalize the problem of selecting the optimal set of options for planning as that of computing the smallest set of options so that planning converges in less than a given maximum of value-iteration passes. We first show that the problem is NP-hard, even if the task is constrained to be deterministic---the first such complexity result for option discovery. We then present the first polynomial-time boundedly suboptimal approximation algorithm for this setting, and empirically evaluate it against both the optimal options and a representative collection of heuristic approaches in simple grid-based domains including the classic four-rooms problem.