## Reinforcement Learning and Planning 2

Moderator: Yutian Chen

Tue 20 Jul 7 a.m. PDT — 8 a.m. PDT

Abstract:

### Chat is not available.

Tue 20 July 7:00 - 7:20 PDT

(Oral)
##### Skill Discovery for Exploration and Planning using Deep Skill Graphs

Akhil Bagaria · Jason Senthil · George Konidaris

We introduce a new skill-discovery algorithm that builds a discrete graph representation of large continuous MDPs, where nodes correspond to skill subgoals and the edges to skill policies. The agent constructs this graph during an unsupervised training phase where it interleaves discovering skills and planning using them to gain coverage over ever-increasing portions of the state-space. Given a novel goal at test time, the agent plans with the acquired skill graph to reach a nearby state, then switches to learning to reach the goal. We show that the resulting algorithm, Deep Skill Graphs, outperforms both flat and existing hierarchical reinforcement learning methods on four difficult continuous control tasks.

[ Paper ]
[ ]
Tue 20 July 7:20 - 7:25 PDT

(Spotlight)
##### Learning Routines for Effective Off-Policy Reinforcement Learning

Edoardo Cetin · Oya Celiktutan

The performance of reinforcement learning depends upon designing an appropriate action space, where the effect of each action is measurable, yet, granular enough to permit flexible behavior. So far, this process involved non-trivial user choices in terms of the available actions and their execution frequency. We propose a novel framework for reinforcement learning that effectively lifts such constraints. Within our framework, agents learn effective behavior over a routine space: a new, higher-level action space, where each routine represents a set of 'equivalent' sequences of granular actions with arbitrary length. Our routine space is learned end-to-end to facilitate the accomplishment of underlying off-policy reinforcement learning objectives. We apply our framework to two state-of-the-art off-policy algorithms and show that the resulting agents obtain relevant performance improvements while requiring fewer interactions with the environment per episode, improving computational efficiency.

[ Paper ]
[ ]
Tue 20 July 7:25 - 7:30 PDT

(Spotlight)
##### PODS: Policy Optimization via Differentiable Simulation

Miguel Angel Zamora Mora · Momchil Peychev · Sehoon Ha · Martin Vechev · Stelian Coros

Current reinforcement learning (RL) methods use simulation models as simple black-box oracles. In this paper, with the goal of improving the performance exhibited by RL algorithms, we explore a systematic way of leveraging the additional information provided by an emerging class of differentiable simulators. Building on concepts established by Deterministic Policy Gradients (DPG) methods, the neural network policies learned with our approach represent deterministic actions. In a departure from standard methodologies, however, learning these policies does not hinge on approximations of the value function that must be learned concurrently in an actor-critic fashion. Instead, we exploit differentiable simulators to directly compute the analytic gradient of a policy's value function with respect to the actions it outputs. This, in turn, allows us to efficiently perform locally optimal policy improvement iterations. Compared against other state-of-the-art RL methods, we show that with minimal hyper-parameter tuning our approach consistently leads to better asymptotic behavior across a set of payload manipulation tasks that demand a high degree of accuracy and precision.

[ Paper ]
[ ]
Tue 20 July 7:30 - 7:35 PDT

(Spotlight)
##### Learning and Planning in Complex Action Spaces

Thomas Hubert · Julian Schrittwieser · Ioannis Antonoglou · Mohammadamin Barekatain · Simon Schmitt · David Silver

Many important real-world problems have action spaces that are high-dimensional, continuous or both, making full enumeration of all possible actions infeasible. Instead, only small subsets of actions can be sampled for the purpose of policy evaluation and improvement. In this paper, we propose a general framework to reason in a principled way about policy evaluation and improvement over such sampled action subsets. This sample-based policy iteration framework can in principle be applied to any reinforcement learning algorithm based upon policy iteration. Concretely, we propose Sampled MuZero, an extension of the MuZero algorithm that is able to learn in domains with arbitrarily complex action spaces by planning over sampled actions. We demonstrate this approach on the classical board game of Go and on two continuous control benchmark domains: DeepMind Control Suite and Real-World RL Suite.

[ Paper ]
[ ]
Tue 20 July 7:35 - 7:40 PDT

(Spotlight)
##### Model-Based Reinforcement Learning via Latent-Space Collocation

Oleh Rybkin · Chuning Zhu · Anusha Nagabandi · Kostas Daniilidis · Igor Mordatch · Sergey Levine

The ability to plan into the future while utilizing only raw high-dimensional observations, such as images, can provide autonomous agents with broad and general capabilities. However, realistic tasks require performing temporally extended reasoning, and cannot be solved with only myopic, short-sighted planning. Recent work in model-based reinforcement learning (RL) has shown impressive results on tasks that require only short-horizon reasoning. In this work, we study how the long-horizon planning abilities can be improved with an algorithm that optimizes over sequences of states, rather than actions, which allows better credit assignment. To achieve this, we draw on the idea of collocation and adapt it to the image-based setting by leveraging probabilistic latent variable models, resulting in an algorithm that optimizes trajectories over latent variables. Our latent collocation method (LatCo) provides a general and effective visual planning approach, and significantly outperforms prior model-based approaches on challenging visual control tasks with sparse rewards and long-term goals. See the videos on the supplementary website \url{https://sites.google.com/view/latco-mbrl/.}

[ Paper ]
[ ]
Tue 20 July 7:40 - 7:45 PDT

(Spotlight)
##### Vector Quantized Models for Planning

Sherjil Ozair · Yazhe Li · Ali Razavi · Ioannis Antonoglou · Aäron van den Oord · Oriol Vinyals

Recent developments in the field of model-based RL have proven successful in a range of environments, especially ones where planning is essential. However, such successes have been limited to deterministic fully-observed environments. We present a new approach that handles stochastic and partially-observable environments. Our key insight is to use discrete autoencoders to capture the multiple possible effects of an action in a stochastic environment. We use a stochastic variant of Monte Carlo tree search to plan over both the agent's actions and the discrete latent variables representing the environment's response. Our approach significantly outperforms an offline version of MuZero on a stochastic interpretation of chess where the opponent is considered part of the environment. We also show that our approach scales to DeepMind Lab, a first-person 3D environment with large visual observations and partial observability.

[ Paper ]
[ ]
Tue 20 July 7:45 - 7:50 PDT

(Spotlight)
##### LTL2Action: Generalizing LTL Instructions for Multi-Task RL

Pashootan Vaezipoor · Andrew C Li · Rodrigo A Toro Icarte · Sheila McIlraith

We address the problem of teaching a deep reinforcement learning (RL) agent to follow instructions in multi-task environments. Instructions are expressed in a well-known formal language – linear temporal logic (LTL) – and can specify a diversity of complex, temporally extended behaviours, including conditionals and alternative realizations. Our proposed learning approach exploits the compositional syntax and the semantics of LTL, enabling our RL agent to learn task-conditioned policies that generalize to new instructions, not observed during training. To reduce the overhead of learning LTL semantics, we introduce an environment-agnostic LTL pretraining scheme which improves sample-efficiency in downstream environments. Experiments on discrete and continuous domains target combinatorial task sets of up to $\sim10^{39}$ unique tasks and demonstrate the strength of our approach in learning to solve (unseen) tasks, given LTL instructions.

[ Paper ]
[ ]
Tue 20 July 7:50 - 7:55 PDT

(Q&A)

[ ]