Track: Reinforcement Learning 4

Wed 11 July 7:00 - 7:20 PDT

Programmatically Interpretable Reinforcement Learning

Abhinav Verma · Vijayaraghavan Murali · Rishabh Singh · Pushmeet Kohli · Swarat Chaudhuri

We present a reinforcement learning framework, called Programmatically Interpretable Reinforcement Learning (PIRL), that is designed to generate interpretable and verifiable agent policies. Unlike the popular Deep Reinforcement Learning (DRL) paradigm, which represents policies by neural networks, PIRL represents policies using a high-level, domain-specific programming language. Such programmatic policies have the benefits of being more easily interpreted than neural networks, and being amenable to verification by symbolic methods. We propose a new method, called Neurally Directed Program Search (NDPS), for solving the challenging nonsmooth optimization problem of finding a programmatic policy with maximal reward. NDPS works by first learning a neural policy network using DRL, and then performing a local search over programmatic policies that seeks to minimize a distance from this neural ‚Äúoracle‚Äù. We evaluate NDPS on the task of learning to drive a simulated car in the TORCS car-racing environment. We demonstrate that NDPS is able to discover human-readable policies that pass some significant performance bars. We also show that PIRL policies can have smoother trajectories, and can be more easily transferred to environments not encountered during training, than corresponding policies discovered by DRL.

Wed 11 July 7:20 - 7:40 PDT

Learning by Playing - Solving Sparse Reward Tasks from Scratch

Martin Riedmiller · Roland Hafner · Thomas Lampe · Michael Neunert · Jonas Degrave · Tom Van de Wiele · Vlad Mnih · Nicolas Heess · Jost Springenberg

We propose Scheduled Auxiliary Control (SAC-X), a new learning paradigm in the context of Reinforcement Learning (RL). SAC-X enables learning of complex behaviors - from scratch - in the presence of multiple sparse reward signals.To this end, the agent is equipped with a set of general auxiliary tasks, that it attempts to learn simultaneously via off-policy RL.The key idea behind our method is that active (learned) scheduling and execution of auxiliary policies allows the agent to efficiently explore its environment - enabling it to excel at sparse reward RL.Our experiments in several challenging robotic manipulation settings demonstrate the power of our approach.

Wed 11 July 7:40 - 7:50 PDT

Automatic Goal Generation for Reinforcement Learning Agents

Carlos Florensa · David Held · Xinyang Geng · Pieter Abbeel

Reinforcement learning (RL) is a powerful technique to train an agent to perform a task; however, an agent that is trained using RL is only capable of achieving the single task that is specified via its reward function. Such an approach does not scale well to settings in which an agent needs to perform a diverse set of tasks, such as navigating to varying positions in a room or moving objects to varying locations. Instead, we propose a method that allows an agent to automatically discover the range of tasks that it is capable of performing in its environment. We use a generator network to propose tasks for the agent to try to accomplish, each task being specified as reaching a certain parametrized subset of the state-space. The generator network is optimized using adversarial training to produce tasks that are always at the appropriate level of difficulty for the agent, thus automatically producing a curriculum. We show that, by using this framework, an agent can efficiently and automatically learn to perform a wide set of tasks without requiring any prior knowledge of its environment, even when only sparse rewards are available. Videos and code available at https://sites.google.com/view/goalgeneration4rl.

Wed 11 July 7:50 - 8:00 PDT

Universal Planning Networks: Learning Generalizable Representations for Visuomotor Control

Aravind Srinivas · Allan Jabri · Pieter Abbeel · Sergey Levine · Chelsea Finn

A key challenge in complex visuomotor control is learning abstract representations that are effective for specifying goals, planning, and generalization. To this end, we introduce universal planning networks (UPN). UPNs embed differentiable planning within a goal-directed policy. This planning computation unrolls a forward model in a latent space and infers an optimal action plan through gradient descent trajectory optimization. The plan-by-gradient-descent process and its underlying representations are learned end-to-end to directly optimize a supervised imitation learning objective. We find that the representations learned are not only effective for goal-directed visual imitation via gradient-based trajectory optimization, but can also provide a metric for specifying goals using images. The learned representations can be leveraged to specify distance-based rewards to reach new target states for model-free reinforcement learning, resulting in substantially more effective learningwhen solving new tasks described via image based goals. We were able to achievesuccessful transfer of visuomotor planning strategies across robots with significantly different morphologies and actuation capabilities. Visit https://sites.google.com/view/upn-public/home for video highlights.

Main Navigation

Session

Reinforcement Learning 4

Programmatically Interpretable Reinforcement Learning

Learning by Playing - Solving Sparse Reward Tasks from Scratch

Automatic Goal Generation for Reinforcement Learning Agents

Universal Planning Networks: Learning Generalizable Representations for Visuomotor Control