Workshop

ICML 2021 Workshop on Unsupervised Reinforcement Learning

Feryal Behbahani, Joelle Pineau, Lerrel Pinto, Roberta Raileanu, Aravind Srinivas, Denis Yarats, Amy Zhang

Abstract:

Unsupervised learning has begun to deliver on its promise in the recent past with tremendous progress made in the fields of natural language processing and computer vision whereby large scale unsupervised pre-training has enabled fine-tuning to downstream supervised learning tasks with limited labeled data. This is particularly encouraging and appealing in the context of reinforcement learning considering that it is expensive to perform rollouts in the real world with annotations either in the form of reward signals or human demonstrations. We therefore believe that a workshop in the intersection of unsupervised and reinforcement learning is timely and we hope to bring together researchers with diverse views on how to make further progress in this exciting and open-ended subfield.

Chat is not available.

Timezone: »

Schedule

Fri 5:45 a.m. - 6:00 a.m.
Opening remarks
Fri 6:00 a.m. - 6:30 a.m.
Invited Talk by David Ha (Invited talk)
David Ha
Fri 6:30 a.m. - 7:00 a.m.
Invited Talk by Alessandro Lazaric (Invited talk)   
Alessandro Lazaric
Fri 7:00 a.m. - 7:30 a.m.
Invited Talk by Kelsey Allen (Invited talk)   
Kelsey Allen
Fri 7:30 a.m. - 8:30 a.m.
Coffee break and Poster Session (Poster Session)
Fri 8:30 a.m. - 9:00 a.m.
Invited Talk by Danijar Hafner (Invited talk)   
Danijar Hafner
Fri 9:00 a.m. - 9:30 a.m.
Invited Talk by Nan Rosemary Ke (Invited talk)   
Rosemary Nan Ke
Fri 9:30 a.m. - 10:30 a.m.
Lunch and Poster Session (Poster session)
Fri 10:30 a.m. - 10:50 a.m.
Oral Presentation: Discovering and Achieving Goals with World Models (Oral presentation)   
Oleg Rybkin, Deepak Pathak
Fri 10:50 a.m. - 11:10 a.m.
Oral Presentation: Planning from Pixels in Environments with Combinatorially Hard Search Spaces (Oral Presentation)   
Georg Martius, Marco Bagatella
Fri 11:10 a.m. - 11:30 a.m.
Oral Presentation: Learning Task Agnostic Skills with Data-driven Guidance (Oral Presentation)   
Even Klemsdal, Abdulmajid Murad
Fri 11:30 a.m. - 12:00 p.m.
Invited Talk by Kianté Brantley (Invited talk)   
Kiante Brantley
Fri 12:00 p.m. - 1:00 p.m.
Coffee break and Poster Session (Poster session)
Fri 1:00 p.m. - 1:30 p.m.
Invited Talk by Chelsea Finn (Invited talk)   
Chelsea Finn
Fri 1:30 p.m. - 2:00 p.m.
Invited Talk by Pieter Abbeel (Invited talk)   
Pieter Abbeel
Fri 2:00 p.m. - 2:30 p.m.
Panel Discussion   
Rosemary Nan Ke, Danijar Hafner, Pieter Abbeel, Chelsea Finn, Chelsea Finn
-
[ Visit Poster at Spot C3 in Virtual World ]
In reinforcement learning, we encode the potential behaviors of an agent interacting with an environment into an infinite set of policies, called policy space, typically represented by a family of parametric functions. Dealing with such a policy space is a hefty challenge, which often causes sample and computational inefficiencies. However, we argue that a limited number of policies is actually relevant when we also account for the structure of the environment and of the policy parameterization, as many of them would induce very similar interactions, i.e., state-action distributions. In this paper, we seek for a reward-free compression of the policy space into a finite set of representative policies, such that, given any policy $\pi$, the minimum Rényi divergence between the state-action distributions of the representative policies and the state-action distribution of $\pi$ is bounded. We show that this compression of the policy space can be formulated as a set cover problem, and it is inherently NP-hard. Nonetheless, we propose a game-theoretic reformulation for which a locally optimal solution can be efficiently found by iteratively stretching the compressed space to cover the most challenging policy. Finally, we provide an empirical evaluation to illustrate the compression procedure in simple domains, and its ripple effects in reinforcement learning.
Mirco Mutti, Stefano Del Col, Marcello Restelli
-
[ Visit Poster at Spot C2 in Virtual World ]

Several recent works have been dedicated to the pure exploration of a single reward-free environment. Along this line, we address the problem of learning to explore a class of multiple reward-free environments with a unique general strategy, which aims to provide a universal initialization to subsequent reinforcement learning problems specified over the same class. Notably, the problem is inherently multi-objective as we can trade off the exploration performance between environments in many ways. In this work, we foster an exploration strategy that is sensitive to the most adverse cases within the class. Hence, we cast the exploration problem as the maximization of the mean of a critical percentile of the state visitation entropy induced by the exploration strategy over the class of environments. Then, we present a policy gradient algorithm, MEMENTO, to optimize the introduced objective through mediated interactions with the class. Finally, we empirically demonstrate the ability of the algorithm in learning to explore challenging classes of continuous environments and we show that reinforcement learning greatly benefits from the pre-trained exploration strategy when compared to learning from scratch.

Mirco Mutti, Mattia Mancassola, Marcello Restelli
-
[ Visit Poster at Spot C1 in Virtual World ]

In the maximum state entropy exploration framework, an agent interacts with a reward-free environment to learn a policy that maximizes the entropy of the expected state visitations it is inducing. Hazan et al. (2019) noted that the class of Markovian stochastic policies is sufficient for the maximum state entropy objective, and exploiting non-Markovianity is generally considered pointless in this setting. In this paper, we argue that non-Markovianity is instead paramount for maximum state entropy exploration in a finite-sample regime. Especially, we recast the objective to target the expected entropy of the induced state visitations in a single trial. Then, we show that the class of non-Markovian deterministic policies is sufficient for the introduced objective, while Markovian policies suffer non-zero regret in general. However, we prove that the problem of finding an optimal non-Markovian policy is NP-hard. Despite this negative result, we discuss avenues to address the problem in a tractable way and how non-Markovian exploration could benefit the sample efficiency of online reinforcement learning in future works.

Mirco Mutti, Riccardo De Santi, Marcello Restelli
-
[ Visit Poster at Spot C2 in Virtual World ]

Learning meaningful behaviors in the absence of a task-specific reward function is a challenging problem in reinforcement learning. A desirable unsupervised objective is to learn a set of diverse skills that provide a thorough coverage of the state space while being directed, i.e., reliably reaching distinct regions of the environment. At test time, an agent could then leverage these skills to solve sparse reward problems by performing efficient exploration and finding an effective goal-directed policy with little-to-no additional learning. Unfortunately, it is challenging to learn skills with such properties, as diffusing (e.g., stochastic policies performing good coverage) skills are not reliable in targeting specific states, whereas directed (e.g., goal-based policies) skills provide limited coverage. In this paper, inspired by the mutual information framework, we propose a novel algorithm designed to maximize coverage while ensuring a constraint on the directedness of each skill. In particular, we design skills with a decoupled policy structure, with a first part trained to be directed and a second diffusing part that ensures local coverage. Furthermore, we leverage the directedness constraint to adaptively add or remove skills as well as incrementally compose them along a tree that is grown to achieve a thorough coverage of the environment. We illustrate how our learned skills enable to efficiently solve sparse-reward downstream tasks in navigation environments, comparing favorably with existing baselines.

Pierre-Alexan Kamienny, Jean Tarbouriech, Alessandro Lazaric, Ludovic Denoyer
-
[ Visit Poster at Spot C0 in Virtual World ]

The shortcomings of maximum likelihood estimation in the context of model-based reinforcement learning have been highlighted by an increasing number of papers. When the model class is misspecified or has a limited representational capacity, model parameters with high likelihood might not necessarily result in high performance of the agent on a downstream control task. To alleviate this problem, we propose an end-to-end approach for model learning which directly optimizes the expected returns using implicit differentiation. We treat a value function that satisfies the Bellman optimality operator induced by the model as an implicit function of model parameters and show how to differentiate the function. We provide theoretical and empirical evidence highlighting the benefits of our approach in the model misspecification regime compared to likelihood-based methods.

Evgenii Nikishin, Romina Abachi, Rishabh Agarwal, Pierre-Luc Bacon
-

The ability to form complex plans based on raw visual input is a litmus test for current capabilities of artificial intelligence, as it requires a seamless combination of visual processing and abstract algorithmic execution, two traditionally separate areas of computer science. A recent surge of interest in this field brought advances that yield good performance in tasks ranging from arcade games to continuous control; these methods however do not come without significant issues, such as limited generalization capabilities and difficulties when dealing with combinatorially hard planning instances. Our contribution is two-fold: (i) we present a method that learns to represent its environment as a latent graph and leverages state reidentification to reduce the complexity of finding a good policy from exponential to linear (ii) we introduce a set of lightweight environments with an underlying discrete combinatorial structure in which planning is challenging even for humans. Moreover, we show that our methods achieves strong empirical generalization to variations in the environment, even across highly disadvantaged regimes, such as “one-shot” planning, or in an offline RL paradigm which only provides low-quality trajectories.

Marco Bagatella, Miroslav Olšák, Michal Rolinek, Georg Martius
-
[ Visit Poster at Spot C1 in Virtual World ]

Designing agents that acquire knowledge autonomously and use it to solve new tasks efficiently is an important challenge in reinforcement learning. Knowledge acquired during an unsupervised pre-training phase is often transferred by fine-tuning neural network weights once rewards are exposed, as is common practice in supervised domains. Given the nature of the reinforcement learning problem, we argue that standard fine-tuning strategies alone are not enough for efficient transfer in challenging domains. We introduce Behavior Transfer (BT), a technique that leverages pre-trained policies for exploration and that is complementary to transferring neural network weights. Our experiments show that, when combined with large-scale pre-training in the absence of rewards, existing intrinsic motivation objectives can lead to the emergence of complex behaviors. These pre-trained policies can then be leveraged by BT to discover better solutions than without pre-training, and combining BT with standard fine-tuning strategies results in additional benefits. The largest gains are generally observed in domains requiring structured exploration, including settings where the behavior of the pre-trained policies is misaligned with the downstream task.

Víctor Campos, Pablo Sprechmann, Steven Hansen, Andre Barreto, Steven Kapturowski, Alex Vitvitskyi, Adrià Puigdomenech Badia, Charles Blundell
-
[ Visit Poster at Spot C0 in Virtual World ]

Data-efficiency and generalization are key challenges in deep learning and deep reinforcement learning as many models are trained on large-scale, domain-specific, and expensive-to-label datasets. Self-supervised models trained on large-scale uncurated datasets have shown successful transfer to diverse settings. We investigate using pretrained image representations and spatio-temporal attention for state representation learning in Atari. We also explore fine-tuning pretrained representations with self-supervised techniques, i.e., contrastive predictive coding, spatio-temporal contrastive learning, and augmentations. Our results show that pretrained representations are at par with state-of-the-art self-supervised methods trained on domain-specific data. Pretrained representations, thus, yield data and compute-efficient state representations.

Mina Khan, Advait Rane, Srivatsa P, Shriram Chenniappa, Rishabh Anand, Sherjil Ozair, Patricia Maes
-
[ Visit Poster at Spot B6 in Virtual World ]

In this paper, we study the problem of representation learning and exploration in reinforcement learning. We propose a framework to compute exploration bonuses based on density estimation, that can be used with any representation learning method, and that allows the agent to explore without extrinsic rewards. In the special case of tabular Markov decision processes (MDPs), this approach mimics the behavior of theoretically sound algorithms. In continuous and partially observable MDPs, the same approach can be applied by learning a latent representation, on which a probability density is estimated.

Omar Darwiche Domingues, Corentin Tallec, Remi Munos, Michal Valko
-

To increase autonomy in reinforcement learning, agents need to learn useful behaviours without reliance on manually designed reward functions. To that end, skill discovery methods have been used to learn the intrinsic options available to an agent using task-agnostic objectives. However, without the guidance of task-specific rewards, emergent behaviours are generally useless due to the under-constrained problem of skill discovery in complex and high-dimensional spaces. This paper proposes a framework for guiding the skill discovery towards the subset of expert-visited states using a learned state projection. We apply our method in various reinforcement learning (RL) tasks and show that such a projection results in more useful behaviours.

Even Klemsdal, Sverre Herland, Abdulmajid Murad
-
[ Visit Poster at Spot B6 in Virtual World ]

While agents trained by Reinforcement Learning (RL) can solve increasingly challenging tasks directly from visual observations, generalizing learned skills to novel environments remains very challenging. Extensive use of data augmentation is a promising technique for improving generalization in RL, but it is often found to decrease sample efficiency and can even lead to divergence. In this paper, we investigate causes of instability when using data augmentation in common off-policy RL algorithms. We identify two problems, both rooted in high-variance Q-targets. Based on our findings, we propose a simple yet effective technique for stabilizing this class of algorithms under augmentation. We perform extensive empirical evaluation of image-based RL using both ConvNets and Vision Transformers (ViT) on a family of benchmarks based on DeepMind Control Suite, as well as in robotic manipulation tasks. Our method greatly improves stability and sample efficiency of ConvNets under augmentation, and achieves generalization results competitive with state-of-the-art methods for image-based RL. We further show that our method scales to RL with ViT-based architectures, and that data augmentation may be especially important in this setting. Code and videos: https://nicklashansen.github.io/SVEA

Nicklas Hansen, Hao Su, Xiaolong Wang
-
[ Visit Poster at Spot B5 in Virtual World ]

A major challenge in reinforcement learning is the design of agents that are able to generalize across tasks that share common dynamics. A viable solution is meta-reinforcement learning, which identifies common structures among past tasks to be then generalized to new tasks (meta-test). Prior works learn meta-representation jointly while solving tasks, resulting in representations that not generalize well across policies, leading to sampling-inefficiency during meta-test phases. In this work, we introduce state2vec, an efficient and low-complexity unsupervised framework for learning disentangled representation that are more general.
The state embedding vectors learned with state2vec capture the geometry of the underlying state space, resulting in high-quality basis functions for linear value function approximation.

Sephora Madjiheurem, Laura Toni
-
[ Visit Poster at Spot B4 in Virtual World ]

Intrinsic rewards are commonly applied to improve exploration in reinforcement learning. However, these approaches suffer from non-stationary reward shaping and strong dependency on hyperparameters. In this work, we propose Decoupled RL (DeRL) which trains separate policies for exploration and exploitation. DeRL can be applied with on-policy and off-policy RL algorithms. We evaluate DeRL algorithms in two exploration-focused environments with five types of intrinsic rewards. We show that DeRL can be more robust to scaling of intrinsic rewards and converge to the same evaluation returns than intrinsically motivated baselines in fewer interactions.

Lukas Schäfer, Filippos Christianos, Josiah Hanna, Stefano V. Albrecht
-
[ Visit Poster at Spot B5 in Virtual World ]

The biasing of dynamical simulations along CVs uncovered by unsupervised learning has become a standard approach in analysis of molecular systems. However, despite parallels with reinforcement learning (RL), state of the art RL methods have yet to reach the molecular dynamics community. The interaction between unsupervised learning, dynamical simulations, and RL is therefore a promising area of research. We introduce a method for enhanced sampling that uses nonlinear geometry estimated by an unsupervised learning algorithm in a reinforcement-learning enhanced sampler. We give theoretical background justifying this method, and show results on data.

James Buenfil, Samson Koelle, Marina Meila
-
[ Visit Poster at Spot B4 in Virtual World ]

Learning data representations that are useful for various downstream tasks is a cornerstone of artificial intelligence. While existing methods are typically evaluated on downstream tasks such as classification or generative image quality, we propose to assess representations through their usefulness in downstream control tasks, such as reaching or pushing objects. By training over 10,000 reinforcement learning policies, we extensively evaluate to what extent different representation properties affect out-of-distribution (OOD) generalization. Finally, we demonstrate zero-shot transfer of these policies from simulation to the real world, without any domain randomization or fine-tuning. This paper aims to establish the first systematic characterization of the usefulness of learned representations for real-world OOD downstream tasks.

Frederik Träuble, Andrea Dittadi, Manuel Wuthrich, Felix Widmaier, Peter V Gehler, Ole Winther, Francesco Locatello, Olivier Bachem, Bernhard Schölkopf, Stefan Bauer
-
[ Visit Poster at Spot B3 in Virtual World ]

Imitation learning learns how to act by observing the behavior of an expert demonstrator. We are concerned with a setting where the demonstrations comprise only a subset of state-action pairs (as opposed to the whole trajectories). Our setup reflects the limitations of real-world problems when accessing the expert data. For example, user logs may contain incomplete traces of behavior, or in robotics non-technical human demonstrators may describe trajectories using only a subset of all state-action pairs. A recent approach to imitation learning via distribution matching, ValueDice, tends to overfit when demonstrations are temporally sparse. We counter the overfitting by contributing regularization losses. Our empirical evaluation with Mujoco benchmarks shows that we can successfully learn from very sparse and scarce expert data. Moreover, (i) the quality of the learned policies is often comparable to those learned with full expert trajectories, and (ii) the number of training steps required to learn from sparse data is similar to the number of training steps when the agent has access to full expert trajectories.

Alberto Camacho, Izzeddin Gur, Marcin Moczulski, Ofir Nachum, Aleksandra Faust
-
[ Visit Poster at Spot B2 in Virtual World ]

Many reinforcement learning (RL) agents require a large amount of experience to solve tasks. We propose Contrastive BERT for RL (CoBERL), an agent that combines a new contrastive loss and a hybrid LSTM-transformer architecture to tackle the challenge of improving data efficiency. CoBERL enables efficient, robust learning from pixels across a wide range of domains. We use bidirectional masked prediction in combination with a generalization of recent contrastive methods to learn better representations for transformers in RL, without the need of hand engineered data augmentations. We find that CoBERL consistently improves performance across the full Atari suite, a set of control tasks and a challenging 3D environment.

Andrea Banino, Adrià Puigdomenech Badia, Jacob C Walker, Tim Scholtes, Jovana Mitrovic, Charles Blundell
-
[ Visit Poster at Spot B1 in Virtual World ]

Reward function specification, which requires considerable human effort and iteration, remains a major impediment for learning behaviors through deep reinforcement learning. In contrast, providing visual demonstrations of desired behaviors presents an easier and more natural way to teach agents. We consider a setting where an agent is provided a fixed dataset of visual demonstrations illustrating how to perform a task, and must learn to solve the task using the provided demonstrations and unsupervised environment interactions. This setting presents a number of challenges including representation learning for visual observations, sample complexity due to high dimensional spaces, and learning instability due to the lack of a fixed reward or learning signal. Towards addressing these challenges, we develop a variational model-based adversarial imitation learning (V-MAIL) algorithm. The model-based approach provides a strong signal for representation learning, enables sample efficiency, and improves the stability of adversarial training by enabling on-policy learning. Through experiments involving several vision-based locomotion and manipulation tasks, we find that V-MAIL learns successful visuomotor policies in a sample-efficient manner, has better stability compared to prior work, and also achieves higher asymptotic performance. We further find that by transferring the learned models, V-MAIL can learn new tasks from visual demonstrations without any additional environment interactions. All results including videos can be found online at \url{https://sites.google.com/view/variational-mail}.

Rafael Rafailov, Tianhe (Kevin) Yu, Aravind Rajeswaran, Chelsea Finn
-

How can an artificial agent learn to solve a wide range of tasks in a complex visual environment in the absence of external supervision? We decompose this question into two problems, global exploration of the environment and learning to reliably reach situations found during exploration. We introduce the Explore Achieve Network (ExaNet), a unified solution to these by learning a world model from the high-dimensional images and using it to train an explorer and an achiever policy from imagined trajectories. Unlike prior methods that explore by reaching previously visited states, our explorer plans to discover unseen surprising states through foresight, which are then used as diverse targets for the achiever. After the unsupervised phase, ExaNet solves tasks specified by goal images without any additional learning. We introduce a challenging benchmark spanning across four standard robotic manipulation and locomotion domains with a total of over 40 test tasks. Our agent substantially outperforms previous approaches to unsupervised goal reaching and achieves goals that require interacting with multiple objects in sequence. Finally, to demonstrate the scalability and generality of our approach, we train a single general agent across four distinct environments. For videos, see https://sites.google.com/view/exanet/home.

Russell Mendonca, Oleg Rybkin, Kostas Daniilidis, Danijar Hafner, Deepak Pathak
-
[ Visit Poster at Spot B0 in Virtual World ]

We present a framework that abstracts Reinforcement Learning (RL) as a sequence modeling problem. This allows us to draw upon the simplicity and scalability of the Transformer architecture, and associated advances in language modeling such as GPT-x and BERT. In particular, we present Decision Transformer, an architecture that casts the problem of RL as conditional sequence modeling. Unlike prior approaches to RL that fit value functions or compute policy gradients, Decision Transformer simply outputs the optimal actions by leveraging a causally masked Transformer. By conditioning an autoregressive model on the desired return (reward), past states, and actions, our Decision Transformer model can generate future actions that achieve the desired return. Despite its simplicity, Decision Transformer matches or exceeds the performance of state-of-the-art model-free offline RL baselines on Atari, OpenAI Gym, and Key-to-Door tasks.

Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Michael Laskin, Pieter Abbeel, Aravind Srinivas, Igor Mordatch
-
[ Visit Poster at Spot B3 in Virtual World ]

Planning in complex environments requires reasoning over multi-step timescales. However, in model-based learning, an agent’s model is more commonly defined over transitions between consecutive states. This leads to plans using intermediate states that are either unnecessary, or worse, introduce cumulative prediction errors. Inspired by the recent works on human time perception, we devise a novel approach for learning a transition dynamics model based on the sequences of episodic memories that define an agent's subjective timescale – over which it learns world dynamics and over which future planning is performed. We analyse the emergent benefits of the subjective-timescale model (STM) by incorporating it into two disparate model-based methods – Dreamer and deep active inference. Using 3D visual foraging tasks, we demonstrate that STM can systematically vary the temporal extent of its predictions and is more likely to predict future salient events (such as new objects coming into view). In comparison to the agents trained using objective timescales, STM agents also collect more rewards due to their ability to perform flexible planning and a more pronounced exploratory behaviour.

Alexey Zakharov, Matthew Crosby, Zafeirios Fountas
-
[ Visit Poster at Spot A6 in Virtual World ]

Reinforcement learning (RL) is typically concerned with estimating single-step policies or single-step models, leveraging the Markov property to factorize the problem in time. However, we can also view RL as a sequence modeling problem, with the goal being to predict a sequence of actions that leads to a sequence of high rewards. Viewed in this way, it is tempting to consider whether powerful, high-capacity sequence prediction models that work well in other domains, such as natural-language processing, can also provide simple and effective solutions to the RL problem. To this end, we explore how RL can be reframed as ``one big sequence modeling'' problem, using state-of-the-art Transformer architectures to model distributions over sequences of states, actions, and rewards. Addressing RL as a sequence modeling problem significantly simplifies a range of design decisions: we no longer require separate behavior policy constraints, as is common in prior work on offline model-free RL, and we no longer require ensembles or other epistemic uncertainty estimators, as is common in prior work on model-based RL. All of these roles are filled by the same Transformer sequence model. In our experiments, we demonstrate the flexibility of this approach across long-horizon dynamics prediction, imitation learning, goal-conditioned RL, and offline RL.

Michael Janner, Qiyang Li, Sergey Levine
-
[ Visit Poster at Spot B2 in Virtual World ]

Learning reward-agnostic representations is an emerging paradigm in reinforcement learning. These representations can be leveraged for several purposes ranging from reward shaping to skill discovery. Nevertheless, in order to learn such representations, existing methods often rely on assuming uniform access to the state space. With such a privilege, the agent’s coverage of the environment can be limited which hurts the quality of the learned representations. In this work, we introduce a method that explicitly couples representation learning with exploration when the agent is not provided with a uniform prior over the state space. Our method learns representations that constantly drive exploration while the data generated by the agent’s exploratory behavior drives the learning of better representations. We empirically validate our approach in goal-achieving tasks, demonstrating that the learned representation captures the dynamics of the environment, leads to more accurate value estimation, and to faster credit assignment, both when used for control and for reward shaping. Finally, the exploratory policy that emerges from our approach proves to be successful at continuous navigation tasks with sparse rewards.

Akram Erraqabi, Harry Zhao, Marlos C. Machado, Yoshua Bengio, Sainbayar Sukhbaatar, Ludovic Denoyer, Alessandro Lazaric
-
[ Visit Poster at Spot A5 in Virtual World ]

The real world is large and complex. It is filled with many objects besides those defined by a task and objects can move with their own interesting dynamics. How should an agent learn to represent state to support efficient learning and generalization in such an environment? In this work, we present a novel memory architecture, Perceptual Schemata, for learning and zero-shot generalization in environments that have many, potentially moving objects. Perceptual Schemata represents state using a combination of schema modules that each learn to attend to and maintain stateful representations of different subspaces of a spatio-temporal tensor describing the agent’s observations. We present empirical results that Perceptual Schemata enables a state representation that can maintain multiple objects observed in sequence with independent dynamics while an LSTM cannot. We additionally show that Perceptual Schemata can generalize more gracefully to larger environments with more distractor objects, while an LSTM quickly overfits to the training tasks.

Wilka T Carvalho, Murray Shanahan
-
[ Visit Poster at Spot B1 in Virtual World ]

Most reinforcement learning (RL) algorithms rely on hand-crafted extrinsic rewards to learn skills. However, crafting a reward function for each skill is not scalable and results in narrow agents that learn reward-specific skills. To alleviate the reliance on reward engineering it is important to develop RL algorithms capable of efficiently acquiring skills with no rewards extrinsic to the agent. While much progress has been made on reward-free exploration in RL, current methods struggle to explore efficiently. Self-play has long been a promising approach for acquiring skills but most successful applications have been in multi-agent zero-sum games with extrinsic reward. In this work, we present SelfPlayer, a data-efficient single-agent self-play exploration algorithm. SelfPlayer samples hard but achievable goals from the agent’s past by maximizing a symmetric KL divergence between the visitation distributions of two copies of the agent, Alice and Bob. We show that SelfPlayer outperforms prior leading self-supervised exploration algorithms such as GoExplore and Curiosity on the data-efficient Atari benchmark.

Michael Laskin, Catherine Cang, Ryan Rudes, Pieter Abbeel
-
[ Visit Poster at Spot B0 in Virtual World ]

Humans and animals explore their environment and acquire useful skills even in the absence of clear goals, exhibiting intrinsic motivation. The study of intrinsic motivation in artificial agents is concerned with the following question: what is a good general-purpose objective for an agent? We study this question in dynamic partially-observed environments, and argue that a compact and general learning objective is to minimize the entropy of the agent's state visitation estimated using a latent state-space model. This objective induces an agent to both gather information about its environment, corresponding to reducing uncertainty, and to gain control over its environment, corresponding to reducing the unpredictability of future world states. We instantiate this approach as a deep reinforcement learning agent equipped with a deep variational Bayes filter. We find that our agent learns to discover, represent, and exercise control of dynamic objects in a variety of partially-observed environments sensed with visual observations without extrinsic reward.

Nicholas Rhinehart, Jenny Wang, Glen Berseth, JD Co-Reyes, Danijar Hafner, Chelsea Finn, Sergey Levine
-
[ Visit Poster at Spot A6 in Virtual World ]

Finding different solutions to the same problem is a key aspect of intelligence associated with creativity and adaptation to novel situations. In reinforcement learning, a set of diverse policies can be useful for exploration, transfer, hierarchy, and robustness. We propose Diverse Successive Policies, a method for discovering policies that are diverse in the space of Successor Features, while assuring that they are near optimal. We formalize the problem as a Constrained Markov Decision Process (CMDP) where the goal is to find policies that maximize diversity, characterized by an intrinsic diversity reward, while remaining near-optimal with respect to the extrinsic reward of the MDP. We also analyze how recently proposed robustness and discrimination rewards perform and find that they are sensitive to the initialization of the procedure and may converge to sub-optimal solutions. To alleviate this, we propose new explicit diversity rewards that aim to minimize the correlation between the Successor Features of the policies in the set. We compare the different diversity mechanisms in the DeepMind Control Suite and find that the type of explicit diversity we are proposing is important to discover distinct behavior, like for example different locomotion patterns.

Tom Zahavy, Brendan O'Donoghue, Andre Barreto, Sebastian Flennerhag, Vlad Mnih, Satinder Singh
-
[ Visit Poster at Spot A5 in Virtual World ]

Maximising a cumulative reward function that is Markov and stationary, i.e, defined over state-action pairs and independent of time, is sufficient to capture many kinds of goals in a Markov Decision Process (MDP) based on the Reinforcement Learning (RL) problem formulation. However, not all goals can be captured in this manner. Specifically, it is easy to see that Convex MDPs in which goals are expressed as convex functions of stationary distributions cannot, in general, be formulated in this manner. In this paper, we reformulate the convex MDP problem as a min-max game between the policy and cost (negative reward) players using Fenchel duality and propose a meta-algorithm for solving it. We show that the average of the policies produced by an RL agent that maximizes the non-stationary reward produced by the cost player converges to an optimal solution to the convex MDP. Finally, we show that the meta-algorithm unifies several disparate branches of reinforcement learning algorithms in the literature, such as apprenticeship learning, variational intrinsic control, constrained MDPs, and pure exploration into a single framework.

Tom Zahavy, Brendan O'Donoghue, Guillaume Desjardins, Satinder Singh
-
[ Visit Poster at Spot A4 in Virtual World ]

Pre-training Reinforcement Learning (RL) agents in a task-agnostic manner has shown promising results. However, previous works still struggle in learning and discovering meaningful skills in high-dimensional state-spaces. We approach the problem by leveraging unsupervised skill discovery and self-supervised learning of state representations. In our work, we learn a compact latent representation by making use of variational and contrastive techniques. We demonstrate that both enable RL agents to learn a set of basic navigation skills by maximizing an information theoretic objective. We assess our method in Minecraft 3D maps with different complexities. Our results show that representations and conditioned policies learned from pixels are enough for toy examples, but do not scale to realistic and complex maps. To overcome these limitations, we explore alternative input observations such as the relative position of the agent along with the raw pixels.

Juan José Nieto, Roger Creus Castanyer, Xavier Giro-i-Nieto
-
[ Visit Poster at Spot A4 in Virtual World ]

We use contrastive learning to obtain task-relevant state-representations from images for reinforcement learning in a real-world system. To test the quality of the representations, an agent is trained with reinforcement learning in the Neuro-Slot-Car environment (Kietzmann & Riedmiller, 2009; Lange et al., 2012). In our experiments, we restrict the distribution from which samples are drawn for comparison in the contrastive loss. Our results show, that the choice of sampling distribution for negative samples is essential to allow task-relevant features to be represented in the presence of more prevalent, but irrelevant features. This adds to recent research on feature suppression and feature invariance in contrastive representation learning. With the training of the reinforcement learning agent, we present to our knowledge a first approach of using contrastive learning of state-representations for control in a real-world environment, using only images from one static camera.

Flemming Brieger, Daniel A Braun, Sascha Lange
-
[ Visit Poster at Spot A3 in Virtual World ]

Exploration in the absence of a concrete task is a key characteristic of autonomous agents and vital for the emergence of intelligent behaviour. Various intrinsic motivation frameworks have been suggested, such as novelty seeking, surprise maximisation or empowerment. Here we focus on the latter, empowerment, an agent-centric and information-theoretic measure of an agent's perceived influence on the world. By considering improving one's empowerment estimator - we call it empowerment gain (EG) - we derive a novel exploration criterion that focuses directly on the desired goal: exploration in order to help the agent recognise its capability to interact with the world. We propose a new theoretical framework based on improving a parametrised estimation of empowerment and show how it integrates novelty, surprise and learning progress into a single formulation. Empirically, we validate our theoretical findings on some simple but instructive grid world environments. We show that while such an agent is still novelty seeking, i.e. interested in exploring the whole state space, it focuses on exploration where its perceived influence is greater, avoiding areas of greater stochasticity or traps that limit its control.

Philip Becker-Ehmck, Maximilian Karl, Jan Peters, Patrick van der Smagt
-
[ Visit Poster at Spot A2 in Virtual World ]
Reinforcement Learning agents require a distribution of environments for their policy to be trained on. The method or process of defining these environments directly impacts robustness and generalization of the learned agent policies. In single agent reinforcement learning, this problem is often solved by domain randomization, or randomizing the environment and tasks within the scope of the desired operating domain of the agent. The challenge here is to generate both structured and solvable environments that guide the agent's learning process. Most recently, works have sought to produce the environments under the Unsupervised Environment Design (UED) formulation. However, these methods lead to a proliferation of adversarial agents to train one agent for a single agent problem in a discretized task domain. In this work, we aim to automatically generate environments that are solvable and challenging for the continuous multi-agent setting. We base our solution on the Teacher-Student relationship with parameter sharing $\textit{Students}$ where we re-imagine the $\textit{Teacher}$ as an environment generator for UED. Our approach uses one environment generator agent ($\textit{Teacher}$) for any number of learning agents ($\textit{Students}$). We qualitatively and quantitatively demonstrate that, in terms of multi-agent ($\geq$ 8 agents) navigation and steering, $\textit{Students}$ trained by our approach outperform agents using heuristic search, as well as agents trained by domain randomization. Our code is available at Link Withheld.
Yiping Wang, Brandon Haworth
-
[ Visit Poster at Spot A3 in Virtual World ]

Reinforcement learning (RL) provides a framework for learning goal-directed policies given user-specified rewards. However, since rewards can be sparse and task-specific, we are interested in the problem of learning without rewards, where agents must discover useful behaviors in the absence of domain-specific incentives. Intrinsic motivation is a family of unsupervised RL techniques which develop general objectives for an RL agent to optimize that lead to better exploration or the discovery of skills. In this paper, we propose a new unsupervised RL technique based on an adversarial game which pits two policies against each other to compete over the amount of surprise an RL agent experiences. The policies each take turns controlling the agent. The Explore policy maximizes entropy, putting the agent into surprising or unfamiliar situations. Then, the Control policy takes over and seeks to recover from those situations by minimizing entropy. The game harnesses the power of multi-agent competition to drive the agent to seek out increasingly surprising parts of the environment while learning to gain mastery over them, leading to better exploration and the emergence of complex skills. Theoretically, we show that under certain assumptions, this game pushes the agent to fully explore the latent state space of stochastic, partially-observed environments, whereas prior techniques will not. Empirically, we demonstrate that even with no external rewards, Adversarial Surprise learns more complex behaviors, and explores more effectively than competitive baselines, outperforming intrinsic motivation methods based on active inference, novelty-seeking (Random Network Distillation (RND)), and multi-agent unsupervised RL (Asymmetric Self-Play (ASP)).

Arnaud Fickinger, Natasha Jaques, Samyak Parajuli, Michael Chang, Nicholas Rhinehart, Glen Berseth, Stuart Russell, Sergey Levine
-
[ Visit Poster at Spot A2 in Virtual World ]

Prior successes in offline learning are highlighted by its adaptability to novel scenarios. One of the key reasons behind this aspect is conservatism, the act of underestimating an agent’s expected value estimates. Recent work, on the other hand, has noted that overconservatism often cripples learning of meaningful behaviors. To that end, the paper asks the question when does overconservatism hurt offline learning? The proposed answer understands conservatism in light of conjugate space and empirical instabilities. In the case of former, agents implicitly aim at learning complex high entropic distributions. As for the latter, overconservatism arises as a consequence of provably inaccurate approximations. Based on theoretical evidence, we address overconservatism through the lens of dynamic control. A feedback controller tunes the learned value estimates by virtue of direct dynamics in the compact latent space. In an empirical study of aerial control tasks on the CF2X quadcopter, we validate our theoretical insights and demonstrate efficacious transfer of offline policies to novel scenarios.

Karush Suri, Florian Shkurti
-
[ Visit Poster at Spot A1 in Virtual World ]

A desirable property of autonomous agents is the ability to both solve long-horizon problems and generalize to unseen tasks. Recent advances in data-driven skill learning have shown that extracting behavioral priors from offline data can enable agents to solve challenging long-horizon tasks with reinforcement learning. However, generalization to tasks unseen during behavioral prior training remains an outstanding challenge. To this end, we present Few-shot Imitation with Skill Transition Models (FIST), an algorithm that extracts skills from offline data and utilizes them to generalize to unseen tasks given a few demonstrations at test-time. FIST learns an inverse skill dynamics model and utilizes a semi-parametric approach for imitation. We show that FIST is capable of generalizing to new tasks and substantially outperforms prior baselines in navigation experiments requiring traversing unseen parts of a large maze and 7-DoF robotic arm experiments requiring manipulating previously unseen objects in a kitchen.

kourosh hakhamaneshi, Ruihan Zhao, Albert Zhan, Pieter Abbeel, Michael Laskin
-
[ Visit Poster at Spot A1 in Virtual World ]

Biological agents have meaningful interactions with their environment despite the absence of a reward signal. In such instances, the agent can learn preferred modes of behaviour that lead to predictable states - necessary for survival. In this paper, we pursue the notion that this learnt behaviour can be a consequence of reward-free preference learning that ensures an appropriate trade-off between exploration and preference satisfaction. For this, we introduce a model-based Bayesian agent equipped with a preference learning mechanism (pepper) using conjugate priors. These conjugate priors are used to augment the expected free energy planner for learning preferences over states (or outcomes) across time. Importantly, our approach enables the agent to learn preferences that encourage adaptive behaviour at test time. We illustrate this in the OpenAI Gym FrozenLake and the 3D mini-world environments -- with and without volatility. Given a constant environment, these agents learn confident (i.e., precise) preferences and act to satisfy them. Conversely, in a volatile setting, perpetual preference uncertainty maintains exploratory behaviour. Our experiments suggest that learnable (reward-free) preferences entail a trade-off between exploration and preference satisfaction. Pepper offers a straightforward framework suitable for designing adaptive agents when reward functions cannot be predefined as in real environments.

Noor Sajid, Panagiotis Tigas, Alexey Zakharov, Zafeirios Fountas, Karl Friston
-
[ Visit Poster at Spot A0 in Virtual World ]

MuZero, a model-based reinforcement learning algorithm that uses a value equivalent dynamics model, achieved state-of-the-art performance in Chess, Shogi and the game of Go. In contrast to standard forward dynamics models that predict a full next state, value equivalent models are trained to predict a future value, thereby emphasizing value relevant information in the representations. While value equivalent models have shown strong empirical success, there is no research yet that visualizes and investigates what types of representations these models actually learn. Therefore, in this paper we visualize the latent representation of MuZero agents. We find that action trajectories may diverge between observation embeddings and internal state transition dynamics, which could lead to instability during planning. Based on this insight, we propose two regularization techniques to stabilize MuZero's performance. Additionally, we provide an open-source implementation of MuZero along with an interactive visualizer of learned representations, which may aid further investigation of value equivalent algorithms.

joery de Vries, Ken Voskuil, Thomas M Moerland, Aske Plaat
-
[ Visit Poster at Spot A0 in Virtual World ]

Modeling controllable aspects of the environment enable better prioritization of interventions and has become a popular exploration strategy in reinforcement learning methods. Despite repeatedly achieving State-of-the-Art results, this approach has only been studied as a proxy to a reward-based task and has not yet been evaluated on its own. We show that solutions relying on action prediction fail to model important events. Humans, on the other hand, assign blame to their actions to decide what they controlled. Here we propose Controlled Effect Network (CEN), an unsupervised method based on counterfactual measures of blame. CEN is evaluated in a wide range of environments showing that it can identify controlled effects better than popular models based on action prediction.

Oriol Corcoll, Raul Vicente