Timezone: »

Workshop
Decision Awareness in Reinforcement Learning
Evgenii Nikishin · Pierluca D'Oro · Doina Precup · Andre Barreto · Amir-massoud Farahmand · Pierre-Luc Bacon

Fri Jul 22 06:00 AM -- 05:00 PM (PDT) @ Hall G

The goal of reinforcement learning (RL) is to maximize a reward signal by taking optimal decisions. An RL system typically contains several moving components, possibly including a policy, a value function, and a model of the environment. We refer to decision awareness as the notion that each of the components and their combination should be explicitly trained to help the agent improve the total amount of collected reward. To better understand decision awareness, consider as an example a model-based method. For environments with rich observations (e.g., pixel-based), the world model is complex and standard approaches would need a large number of samples and a high-capacity function approximator to learn a reasonable approximation of the dynamics. However, a decision-aware agent might recognize that modeling all the granular complexity of the environment is neither feasible nor necessary to learn an optimal policy and instead focus on modeling aspects that are important for decision making. Decision awareness goes beyond the model learning aspect. In actor-critic algorithms, a critic is trained to predict the expected return while later used to aid policy learning. Is return prediction an optimal strategy for critic learning? And, in general, what is the best way to learn each component of an RL system? Our workshop aims at answering these questions and articulating that decision awareness might be a key towards solving grand challenges in RL, including exploration and sample efficiency. The workshop is about decision-aware RL algorithms, their implications, and real-world applications; we focus on decision-aware objectives, end-to-end procedures, and meta-learning techniques for training and discovering components in modular RL systems, as well as theoretical or empirical analyses of the interaction among multiple modules used by RL algorithms.

 Fri 6:00 a.m. - 5:00 p.m. Please visit the workshop website for the full program (Program)  link » 🔗 Fri 6:00 a.m. - 6:20 a.m. Opening Remarks (Presentation) 🔗 Fri 6:20 a.m. - 7:00 a.m. Differentiable optimization for control and reinforcement learning (Invited Talk) Brandon Amos 🔗 Fri 7:00 a.m. - 7:30 a.m. Break 🔗 Fri 7:30 a.m. - 8:10 a.m. Discovering RL Algorithms (Invited Talk) Junhyuk Oh 🔗 Fri 8:10 a.m. - 9:00 a.m. Discovered Policy Optimisation. Live in the Moment: Learning Dynamics Model Adapted to Evolving Policy. Adaptive Interest for Emphatic Reinforcement Learning (Contributed Talks) 🔗 Fri 9:00 a.m. - 10:40 a.m. Break 🔗 Fri 10:40 a.m. - 11:20 a.m. The Value Equivalence Principle for Model-Based RL (Invited Talk) Christopher Grimm 🔗 Fri 11:20 a.m. - 12:00 p.m. A Model-Based Reinforcement Learning Wishlist (Invited Talk) Erin Talvitie 🔗 Fri 12:00 p.m. - 12:30 p.m. Break 🔗 Fri 12:30 p.m. - 1:30 p.m. DARL Panel (Panel Discussion) 🔗 Fri 1:30 p.m. - 2:30 p.m. Poster Session (In-person only poster presentation) 🔗 Fri 2:30 p.m. - 3:10 p.m. Policy Gradient: Theory for Making Best Use of It (Invited Talk) Mengdi Wang 🔗 Fri 3:10 p.m. - 3:50 p.m. General-purpose meta learning (Invited Talk) Louis Kirsch 🔗 Fri 3:50 p.m. - 5:00 p.m. Closing Remarks & Poster Session (Presentation followed by an In-person only poster presentation) 🔗 - Effective Offline RL Needs Going Beyond Pessimism: Representations and Distributional Shift (Poster)  link » Standard off-policy reinforcement learning (RL) methods based on temporal difference (TD) learning generally fail to learn good policies when applied to static offline datasets. Conventionally, this is attributed to distribution shift, where the Bellman backup queries high-value out-of-distribution (OOD) actions for the next time step, which then leads to systematic overestimation. However, this explanation is incomplete, as conservative offline RL methods that directly address overestimation still suffer from stability problems in practice. This suggests that although OOD actions may account for part of the challenge, the difficulties with TD learning in the offline setting are also deeply connected to other aspects such as the quality of representations of learned function approximators. In this work, we demonstrate that merely imposing pessimism is not sufficient for good performance, and demonstrate empirically that regularizing representations actually accounts for a large part of the improvement observed in modern offline RL methods. Building on this insight, we identify concrete metrics that enable effective diagnosis of the quality of the learned representation, and are able to adequately predict performance of the underlying method. Finally, we show that a simple approach for handling representations, without any changing any other aspect of conservative offline RL algorithms can lead to better performance in several offline RL problems. Link » Xinyang Geng · Kevin Li · Abhishek Gupta · Aviral Kumar · Sergey Levine 🔗 - Hyperbolically Discounted Advantage Estimation for Generalization in Reinforcement Learning (Poster)  link »    In reinforcement learning (RL), agents typically discount future rewards using an exponential scheme. However, studies have shown that humans and animals instead exhibit hyperbolic time-preferences and thus discount future rewards hyperbolically. In the quest for RL agents that generalize well to previously unseen scenarios, we study the effects of hyperbolic discounting on generalization tasks and present Hyperbolic Discounting for Generalization in Reinforcement Learning (HDGenRL). We propose a hyperbolic discounting-based advantage estimation method that makes the agent aware of and robust to the underlying uncertainty of survival and episode duration. On the challenging RL generalization benchmark Procgen, our proposed approach achieves up to 200\% performance improvement over the PPO baseline that uses classical exponential discounting. We also incorporate hyperbolic discounting into another generalization-specific approach (APDAC), and the results indicate further improvement in APDAC's generalization ability. This denotes the effectiveness of our approach as a plug-in to any existing methods in aiding generalization. Link » Nasik Muhammad Nafi · Raja Farrukh Ali · William Hsu 🔗 - Deep Policy Generators (Poster)  link » Traditional Reinforcement Learning (RL) learns policies that maximize expected return. Here we study neural nets (NNs) that learn to generate policies in form of context-specific weight matrices, similar to Fast Weight Programmers and other methods from the 1990s. Using context commands of the form "generate a policy that achieves a desired expected return," our NN generators combine powerful exploration of parameter space with greedy command choices to iteratively find better and better policies. A form of weight-sharing HyperNetworks and policy embeddings scales our method to generate deep NNs. Experiments show how a single learned policy generator can produce policies that achieve any return seen during training. Finally, we evaluate our algorithm on a set of continuous control tasks where it exhibits competitive performance. Link » Francesco Faccio · Vincent Herrmann · Aditya Ramesh · Louis Kirsch · Jürgen Schmidhuber 🔗 - CoMBiNED: Multi-Constrained Model Based Planning for Navigation in Dynamic Environments (Poster)  link »    Recent model based planning approaches have attained a huge success on Atari games. However, learning accurate models for complex robotics scenarios such as navigation directly from high dimensional sensory measurements requires a huge amount of data and training. Furthermore, even a small change on robot configuration such as kino-dynamics or sensor in the inference time requires re-training of the policy. In this paper, we address these issues in a principled fashion through a \textit{multi-constraint model based online planning} (CoMBiNED) framework that does not require any retaining or modifications on the existing policy. We disentangle the given task into sub-tasks and learn dynamical models for them. Treating these dynamical models as soft-constraints, we employ stochastic optimisation to employ cross entropy method to jointly optimize these sub-tasks on-the-fly. We Consider navigation as central application in this work and evaluate our approach on publicly available benchmark with complex dynamic scenarios and achieved significant improvement over recent approaches both in the cases of with-and-without given map of the environment. Link » Harit Pandya · Rudra Poudel · Stephan Liwicki 🔗 - Exploration Hurts in Bandits with Partially Observed Stochastic Contexts (Poster)  link » Contextual bandits are widely-used models in reinforcement learning for incorporating both common and idiosyncratic factors in reward functions. The existing approaches rely on full observation of the stochastic context vectors, while the problem of learning optimal arms from partially observed contexts remains immature. We theoretically show that in the latter setting, decisions can be made more guarded to minimize the risk of pulling sub-optimal arms. More precisely, efficiency is established for Greedy policies that treat the estimates of the unknown parameter and of the unobserved contexts as their true values. That includes non-asymptotic worst-case regret bounds that grow (poly-)logarithmically with the time horizon and failure probability, and linearly with the number of arms. Numerical results that showcase the efficacy of avoiding exploration are provided. Link » Hongju Park · Mohamad Kazem Shirani Faradonbeh 🔗 - Exploration in Reward Machines with Low Regret (Poster)  link »    We study reinforcement learning (RL) for decision processes with non-Markovian reward, in which high-level knowledge in the form of reward machines is available to the learner. Specifically, we investigate the efficiency of RL under the average-reward criterion, in the regret minimization setting. We propose two model-based RL algorithms that each exploits the structure of the reward machines, and show that our algorithms achieve regret bounds that improve over those of baselines by a multiplicative factor proportional to the number of states in the underlying reward machine. To the best of our knowledge, the proposed algorithms and associated regret bounds are the first to tailor the analysis specifically to reward machines, either in the episodic or average-reward settings. We also present a regret lower bound for the studied setting, which indicates that the proposed algorithms achieve a near-optimal regret. Finally, we report numerical experiments that demonstrate the superiority of the proposed algorithms over existing baselines in practice. Link » Hippolyte Bourel · Anders Jonsson · Odalric-Ambrym Maillard · Mohammad Sadegh Talebi 🔗 - Exploring Long-Horizon Reasoning with Deep RL in Combinatorially Hard Tasks (Poster)  link » Deep reinforcement learning has shown promise in discrete domains requiring complex reasoning, including games such as Chess, Go, and Hanabi. However, this type of reasoning is less often observed in long-horizon, continuous domains with high-dimensional observations, where instead RL research has predominantly focused on problems with simple high-level structure (e.g. opening a drawer or moving a robot as fast as possible). Inspired by combinatorially hard optimization problems, we propose a set of robotics tasks which admit many distinct solutions at the high-level, but require reasoning about states and rewards thousands of steps into the future for the best performance. Critically, while RL has traditionally suffered on complex, long-horizon tasks due to sparse rewards, our tasks are carefully designed to be solvable without specialized exploration. Nevertheless, our investigation finds that standard RL methods often neglect long-term effects due to discounting, while general-purpose hierarchical RL approaches struggle unless additional abstract domain knowledge can be exploited. Link » Andrew C Li · Pashootan Vaezipoor · Rodrigo A Toro Icarte · Sheila McIlraith 🔗 - VIPer: Iterative Value-Aware Model Learning on the Value Improvement Path (Poster)  link »    We propose a practical and generalizable Decision-Aware Model-Based Reinforcement Learning algorithm. We extend the frameworks of VAML (Farahmand et al., 2017) and IterVAML (Farahmand, 2018), which have been shown to be difficult to scale to high-dimensional and continuous environments (Lovatto et al., 2020a; Modhe et al., 2021; Voelcker et al., 2022). We propose to use the notion of the Value Improvement Path (Dabney et al., 2020) to improve the generalization of VAML-like model learning. We show theoretically for linear and tabular spaces that our proposed algorithm is sensible, justifying extension to non-linear and continuous spaces. We also present a detailed implementation proposal based on these ideas. Link » Romina Abachi · Claas Voelcker · Animesh Garg · Amir-massoud Farahmand 🔗 - Model-Based Meta Automatic Curriculum Learning (Poster)  link »    When an agent trains for one target task, its experience is expected to be useful for training on another target task. This paper formulates the meta curriculum learning problem that builds a sequence of intermediate training tasks, called a curriculum, which will assist the learner to train toward any given target task in general. We propose a model-based meta automatic curriculum learning algorithm (MM-ACL) that learns to predict the performance on one task when trained on another, given contextual information such as the history of training tasks, loss functions, rollout state-action trajectories from the policy, etc. This predictor facilitates the generation of a curriculum that optimizes the performance of the learner on different target tasks. Our empirical results demonstrate that MM-ACL outperforms a random curriculum, a manually created curriculum, and a commonly used non-stationary bandit algorithm in a GridWorld domain. Link » Zifan Xu · Yulin Zhang · Shahaf Shperberg · Reuth Mirsky · Yuqian Jiang · Bo Liu · Peter Stone 🔗 - Adaptive Interest for Emphatic Reinforcement Learning (Spotlight)  link »    Emphatic algorithms have shown great promise in stabilizing and improving reinforcement learning by selectively emphasizing the update rule. Although the emphasis fundamentally depends on an interest function which defines the intrinsic importance of each state, most approaches simply adopt a uniform interest over all states (except where a hand-designed interest is possible based on domain knowledge). In this paper, we investigate adaptive methods that allow the interest function to dynamically vary over states and iterations. In particular, we leverage meta-gradients to automatically discover online an interest function that would accelerate the agent’s learning process. Empirical evaluations on a wide range of environments show that adapting the interest is key to provide significant gains. Qualitative analysis indicates that the learned interest function emphasizes states of particular importance, such as bottlenecks, which can be especially useful in a transfer learning setting. Link » Martin Klissarov · Rasool Fakoor · Jonas Mueller · Kavosh Asadi · Taesup Kim · Alex Smola 🔗 - General Policy Evaluation and Improvement by Learning to Identify Few But Crucial States (Poster)  link » Learning to evaluate and improve policies is a core problem of Reinforcement Learning (RL). Traditional RL algorithms learn a value function defined for a single policy. A recently explored competitive alternative is to learn a single value function for many policies. Here we combine the actor-critic architecture of Parameter-Based Value Functions and the policy embedding of Policy Evaluation Networks to learn a single value function for evaluating (and thus helping to improve) any policy represented by a deep neural network (NN). The method yields competitive experimental results. In continuous control problems with infinitely many states, our value function minimizes its prediction error by simultaneously learning a small set of probing states' and a mapping from actions produced in probing states to the policy's return. The method extracts crucial abstract knowledge about the environment in form of very few states sufficient to fully specify the behavior of many policies. A policy improves solely by changing actions in probing states, following the gradient of the value function's predictions. Surprisingly, it is possible to clone the behavior of a near-optimal policy in Swimmer-v3 and Hopper-v3 environments only by knowing how to act in 3 and 5 such learned states, respectively. Remarkably, our value function trained to evaluate NN policies is also invariant to changes of the policy architecture: we show that it allows for zero-shot learning of linear policies competitive with the best policy seen during training. Link » Francesco Faccio · Aditya Ramesh · Vincent Herrmann · Jean Harb · Jürgen Schmidhuber 🔗 - An Investigation into the Open World Survival Game Crafter (Poster)  link »    We share our experience with the recently released Crafter benchmark, a 2D open world survival game. Crafter allows tractable investigation of novel agents and their generalization, exploration and long-term reasoning capabilities. We evaluate agents on the original Crafter environment, as well as on a newly introduced set of generalization environments, suitable for evaluating agents' robustness to unseen objects and fast-adaptation (meta-learning) capabilities. Through several experiments we provide a couple of critical insights that are of general interest for future work on Crafter. We find that: (1) Simple agents with tuned hyper-parameters outperform all previous agents. (2) Feedforward agents can unlock almost all achievements by relying on the inventory display. (3) Recurrent agents improve on feedforward ones, also without the inventory information. (4) All agents (including interpretable object-centric ones) fail to generalize to OOD objects. We will open-source our code. Link » Aleksandar Stanic · Yujin Tang · David Ha · Jürgen Schmidhuber 🔗 - Unsupervised Model-based Pre-training for Data-efficient Reinforcement Learning from Pixels (Poster)  link »    Reinforcement learning aims at autonomously performing complex tasks. To this end, a reward signal is used to steer the learning process. While successful in many circumstances, the approach is typically data hungry, requiring large amounts of task-specific interaction between agent and environment to learn efficient behaviors. To alleviate this, unsupervised reinforcement learning proposes to collect data through self-supervised interaction to accelerate task-specific adaptation. However, whether current unsupervised strategies lead to improved generalization capabilities is still unclear, more so when the input observations are high-dimensional. In this work, we advance the field by closing the performance gap in the Unsupervised Reinforcement Learning Benchmark, a collection of tasks to be solved in a data-efficient manner, after interacting with the environment in a self-supervised way. Our model-based approach combines exploration and planning to efficiently fine-tune unsupervised pre-trained models, achieving comparable results to task-specific baselines. We extensively evaluate our work, comparing several exploration methods and improving fine-tuning by studying the interaction between the model components. Furthermore, we investigate the limits of the learned model and the unsupervised methods to gain insights into how these influence the decision process, shedding light on new research directions. Link » Sai Rajeswar · Pietro Mazzaglia · Tim Verbelen · Alex Piche · Bart Dhoedt · Aaron Courville · Alexandre Lacoste 🔗 - Model-Based Reinforcement Learning with SINDy (Poster)  link »    We draw on the latest advancements in the physics community to propose a novelmethod for discovering the governing non-linear dynamics of physical systemsin reinforcement learning (RL). We establish that this method is capable ofdiscovering the underlying dynamics using significantly fewer trajectories (aslittle as one rollout with $\leq 30$ time steps) than state of the art modellearning algorithms. Further, the technique learns a model that is accurateenough to induce near-optimal policies given significantly fewer trajectoriesthan those required by model-free algorithms. It brings the benefits ofmodel-based RL without requiring a model to be developed in advance, forsystems that have physics-based dynamics.To establish the validity and applicability of this algorithm, we conductexperiments on four classic control tasks. We found that an optimal policytrained on the discovered dynamics of the underlying system can generalizewell. Further, the learned policy performs well when deployed on the actualphysical system, thus bridging the model to real system gap. We furthercompare our method to state-of-the-art model-based and model-free approaches,and show that our method requires fewer trajectories sampled on the truephysical system compared other methods. Additionally, we explored approximatedynamics models and found that they also can perform well. Link » Rushiv Arora · Eliot Moss · Bruno da Silva 🔗 - Toward Human Cognition-inspired High-Level Decision Making For Hierarchical Reinforcement Learning Agents (Poster)  link »    The ability of humans to efficiently understand and learn to solve complex tasks with relatively limited data is attributed to our hierarchically organized decision-making process.Meanwhile, sample efficiency is a long-standing challenge for reinforcement learning (RL) agents, especially in long-horizon, sequential decision-making tasks with sparse and delayed rewards.Hierarchical reinforcement learning (HRL) augments RL agents with temporal abstraction to improve their efficiency in such complex tasks.However, the decision-making process of most HRL methods is often based directly on dense low-level information, while also using fixed temporal abstraction.We propose the hierarchical world model (HWM), which is geared toward capturing more flexible high-level, temporally abstract dynamics, as well as low-level dynamics of the task.Preliminary experiments on using the HWM with model-based RL resulted in improved sample efficiency and final performance.An investigation of the state representations learned by the HWM also shows their alignment with human intuition and understanding.Finally, we provide a theoretical foundation for integrating the proposed HWM with the HRL framework, thus building toward RL agents with hierarchically structured decision-making which aligns with the theorized principles of human cognition and decision process. Link » Rousslan F. J. Dossa · Takashi Matsubara 🔗 - MoCoDA: Model-based Counterfactual Data Augmentation (Poster)  link »    The number of states in a dynamic process is exponential in the number of objects, making reinforcement learning (RL) difficult in complex, multi-object domains. For agents to scale to the real world, they will need to react to and reason about unseen combinations of objects. We argue that the ability to recognize and use local factorization in transition dynamics is a key element in unlocking the power of multi-object reasoning. To this end, we show that (1) known local structure in the environment transitions is sufficient for an exponential reduction in the sample complexity of training a dynamics model, and (2) a locally factored dynamics model provably generalizes out-of-distribution to unseen states and actions. Knowing the local structure also allows us to predict which unseen states and actions this dynamics model will generalize to. We propose to leverage these observations in a novel Model-based Counterfactual Data Augmentation (MoCoDA) framework. MoCoDA applies a learned locally factored dynamics model to an augmented distribution of states and actions to generate counterfactual transitions for RL. MoCoDA works with a broader set of local structures than prior work and allows for direct control over the augmented training distribution. We show that MoCoDA enables RL agents to learn policies that generalize to unseen states and actions. We use MoCoDA to train an offline RL agent to solve an out-of-distribution robotics manipulation task on which standard offline RL algorithms fail. Link » Silviu Pitis · Elliot Creager · Ajay Mandlekar · Animesh Garg 🔗 - An Adaptive Entropy-Regularization Framework for Multi-Agent Reinforcement Learning (Poster)  link »    In this paper, we propose an adaptive entropy-regularization framework (ADER) for multi-agent reinforcement learning (RL) to learn the adequate amount of exploration for each agent based on the degree of required exploration. In order to handle instability arising from updating multiple entropy temperature parameters for multiple agents, we disentangle the soft value function into two types: one for pure reward and the other for entropy. By applying multi-agent value factorization to the disentangled value function of pure reward, we obtain a relevant metric to assess the necessary degree of exploration for each agent. Based on this metric, we propose the ADER algorithm based on maximum entropy RL, which controls the necessary level of exploration across agents over time by learning the proper target entropy for each agent. Experimental results show that the proposed scheme significantly outperforms current state-of-the-art multi-agent RL algorithms. Link » WOOJUN KIM · Youngchul Sung 🔗 - Leader-based Decision Learning for Cooperative Multi-Agent Reinforcement Learning (Poster)  link » A leader in the team enables efficient learning for other novices in the social learning setting for both humans and animals. This paper constructs the leader-based decision learning framework for Multi-Agent Reinforcement Learning and investigates whether the leader enables the learning of novices as well. We compare three different approaches to distilling a leader's experiences: Linear Layer Dimension Reduction, Attentive Graph Pooling, and Attention-based Graph Neural Network. We successfully show that a leader-based decision learning can 1) enable agents to learn faster, cooperate more effectively, and escape local optimum, and 2) promote the generalizability of agents in more challenging and unseen environments. The key to effective distillation is to maintain and aggregate important information. Link » Wenqi Chen · Xin Zeng · Amber Li 🔗 - Recursive History Representations for Unsupervised Reinforcement Learning in Multiple-Environments (Poster)  link » In recent years, the area of Unsupervised Reinforcement Learning (URL) has gained particular relevance. In this setting, an agent is pre-trained in an environment with reward-free interactions, often through a maximum state entropy objective that drives the agent towards a uniform coverage of the state space. It has been shown that this pre-training phase leads to significant performance improvements in downstream tasks later given to the agent to solve. The multiple-environments version of this setting introduces the problem of controlling the performance trade-offs in the environment class and leads to the following question: Can we build Pareto optimal policies for multiple-environments URL? In this work, we answer this question by proposing a novel non-Markovian policy architecture to be trained with the maximum state entropy objective. This architecture showcases significant empirical advantages when compared to state-of-the-art Markovian agents. Link » Mirco Mutti · Pietro Maldini · Riccardo De Santi · Marcello Restelli 🔗 - Building a Subspace of Policies for Scalable Continual Learning (Poster)  link »    The ability to continuously acquire new knowledge and skills is crucial for autonomous agents. However, existing methods are typically based on either fixed-size models that cannot capture many diverse behaviors, or growing-size models that scale poorly with the number of tasks. In this paper, we introduce Continual Subspace of Policies (CSP), a method that iteratively learns a subspace of policies in the continual reinforcement learning setting where tasks are presented sequentially. The subspace's high expressivity allows our method to strike a good balance between stability (i.e. not forgetting prior tasks) and plasticity (i.e. learning new tasks), while the number of parameters grows sublinearly with the number of tasks. In addition, CSP displays good transfer, being able to quickly adapt to new tasks including combinations of previously seen ones without additional training. Finally, CSP outperforms state-of-the-art methods on a wide range of scenarios in two different domains. An interactive visualization of the subspace can be found at https://share.streamlit.io/continual-subspace/policies/main. Link » Jean-Baptiste Gaya · Thang Doan · Lucas Caccia · Laure Soulier · Ludovic Denoyer · Roberta Raileanu 🔗 - DASCO: Dual-Generator Adversarial Support Constrained Offline Reinforcement Learning (Poster)  link »    In offline RL, constraining the learned policy to remain close to the data is essential to prevent the policy from outputting out-of-distribution (OOD) actions with erroneously overestimated values. In principle, generative adversarial networks (GAN) can provide an elegant solution to do so, with the discriminator directly providing a probability that quantifies distributional shift. However, in practice, GAN-based offline RL methods have not outperformed alternative approaches, perhaps because the generator is trained to both fool the discriminator and maximize return - two objectives that are often at odds with each other. In this paper, we show that the issue of conflicting objectives can be resolved by training two generators: one that maximizes return, with the other capturing the "remainder" of the data distribution in the offline dataset, such that the mixture of the two is close to the behavior policy. We show that not only does having two generators enable an effective GAN-based offline RL method, but also approximates a support constraint, where the policy does not need to match the entire data distribution, but only the slice of the data that leads to high long term performance. We name our method DASCO, for Dual-Generator Adversarial Support Constrained Offline RL. On benchmark tasks that require learning from sub-optimal data, DASCO significantly outperforms prior methods that enforce distribution constraint. Link » Quan Vuong · Aviral Kumar · Sergey Levine · Yevgen Chebotar 🔗 - Representation Gap in Deep Reinforcement Learning (Poster)  link »    Deep reinforcement learning gives the promise that an agent learns good policy from high-dimensional information. Whereas representation learning removes irrelevant and redundant information and retains pertinent information. We consider the representation capacity of action value function and theoretically reveal its inherent property, representation gap with its target action value function. This representation gap is favorable. However, through illustrative experiments, we show that the representation of action value function grows similarly compared with its target value function, i.e. the undesirable inactivity of the representation gap (representation overlap). Representation overlap results in a loss of representation capacity, which further leads to sub-optimal learning performance. To activate the representation gap, we propose a simple but effective framework Policy Optimization from Preventing Representation Overlaps (POPRO), which regularizes the policy evaluation phase through differing the representation of action value function from its target. We also provide the convergence rate guarantee of POPRO. We evaluate POPRO on gym continuous control suites. The empirical results show that POPRO using pixel inputs outperforms or parallels the sample-efficiency of methods that use state-based features. Link » Qiang He · Huangyuan Su · Jieyu Zhang · Xinwen Hou 🔗 - Challenges and Opportunities in Offline Reinforcement Learning from Visual Observations (Poster)  link »    Offline reinforcement learning has shown great promise in leveraging large pre-collected datasets for policy learning, allowing agents to forgo often-expensive online data collection. However, to date, offline reinforcement learning from visual observations with continuous action spaces has been relatively under-explored, and there is a lack of understanding of where the remaining challenges lie. In this paper, we seek to establish simple baselines for continuous control in the visual domain. We show that simple modifications to two state-of-the-art vision-based online reinforcement learning algorithms, DreamerV2 and DrQ-v2, suffice to outperform prior work and establish a competitive baseline. We rigorously evaluate these algorithms on both existing offline datasets and a new testbed for offline reinforcement learning from visual observations that better represents the data distributions present in real-world offline RL problems, and open-source our code and data to facilitate progress in this important domain. Finally, we present and analyze several key desiderata unique to offline RL from visual observations, including visual distractions and visually identifiable changes in dynamics. Link » Cong Lu · Philip Ball · Tim G. J Rudner · Jack Parker-Holder · Michael A Osborne · Yee-Whye Teh 🔗 - Giving Feedback on Interactive Student Programs with Meta-Exploration (Poster)  link »    Creating interactive software, such as websites or games, is a particularly engaging way to learn computer science. However, teaching and giving feedback on such software is hard — standard approaches require instructors to hand grade student-implemented interactive programs. As a result, online platforms that serve millions, like Code.org, are unable to provide any feedback on assignments for implementing interactive programs, which critically hinders students’ ability to learn. Recent work proposes to train reinforcement learning agents to interact with a student’s program, aiming to explore states indicative of errors. However, this approach only provides binary feedback of whether a program is correct or not, while students require finer-grained feedback on the specific errors in their programs to understand their mistakes. In this work, we show that exploring to discover errors can be cast as a meta-exploration problem. This enables us to construct a principled objective for discovering errors and an algorithm for optimizing this objective, which provides fine-grained feedback. We evaluate our approach on a set of 700K real anonymized student programs from a Code.org interactive assignment. Our approach provides feedback with 94.3% accuracy, improving over existing approaches by over 17.7% and coming within 1.5% of human-level accuracy. Link » Evan Liu · Moritz Stephan · Allen Nie · Chris Piech · Emma Brunskill · Chelsea Finn 🔗 - When to Ask for Help: Proactive Interventions in Autonomous Reinforcement Learning (Poster)  link » A long-term goal of reinforcement learning is to design agents that can autonomously interact and learn in the world. A critical challenge to such autonomy is the presence of irreversible states which require external assistance to recover from, such as when a robot arm has pushed an object off of a table. While standard agents require constant monitoring to decide when to intervene, we aim to design proactive agents that can request human intervention only when needed. To this end, we propose an algorithm that can efficiently learns to detect and avoid states that are irreversible, and proactively ask for help in case the agent does enter them. On a suite of continuous control environments with unknown irreversible states, we find that our algorithm exhibits both better sample- and intervention-efficiency compared to existing methods. Link » Annie Xie · Fahim Tajwar · Archit Sharma · Chelsea Finn 🔗 - Beyond the Return: Off-policy Function Estimation under User-specified Error-measuring Distributions (Poster)  link »    Off-policy evaluation often refers to two related tasks: estimating the expected return of a policy and estimating its value function (or other functions of interest, such as density ratios). While recent works on marginalized importance sampling (MIS) show that the former can enjoy provable guarantees under realizable function approximation, the latter is only known to be feasible under much stronger assumptions such as prohibitively expressive discriminators. In this work, we provide guarantees for off-policy function estimation under only realizability, by imposing proper regularization on the MIS objectives. Compared to commonly used regularization in MIS, our regularizer is much more flexible and can account for an arbitrary user-specified distribution, under which the learned function will be close to the ground truth. We provide exact characterization of the optimal dual solution that needs to be realized by the discriminator class, which determines the data-coverage assumption in the case of value-function learning. As another surprising observation, the regularizer can be altered to relax the data-coverage requirement, and completely eliminate it in the ideal case with strong side information. Link » Audrey Huang · Nan Jiang 🔗 - Maximum-Likelihood Inverse Reinforcement Learning with Finite-Time Guarantees (Poster)  link »    Inverse reinforcement learning (IRL) aims to recover the reward function and the associated optimal policy that best fits observed sequences of states and actions implemented by an expert. Many algorithms for IRL have an inherent nested structure: the inner loop finds the optimal policy given parametrized rewards while the outer loop updates the estimates towards optimizing a measure of fit. For high dimensional environments such nested-loop structure entails a significant computational burden. To reduce the computational burden of a nested loop, novel methods such as SQIL [1] and IQ-Learn [2] emphasize policy estimation at the expense of reward estimation accuracy. However, without accurate estimated rewards, it is not possible to do counterfactual analysis such as predicting the optimal policy under different environment dynamics and/or learning new tasks. In this paper we develop a novel single-loop algorithm for IRL that does not compromise reward estimation accuracy. In the proposed algorithm, each policy improvement step is followed by a stochastic gradient step for likelihood maximization. We show that the proposed algorithm provably converges to a stationary solution with a finite-time guarantee. If the reward is parameterized linearly, we show the identified solution corresponds to the solution of the maximum entropy IRL problem. Finally, by using robotics control problems in Mujoco and their transfer settings, we show that the proposed algorithm achieves superior performance compared with other IRL and imitation learning benchmarks. Link » Siliang Zeng · Chenliang Li · Alfredo Garcia · Mingyi Hong 🔗 - You Can’t Count on Luck: Why Decision Transformers Fail in Stochastic Environments (Poster)  link »    Recently, methods such as Decision Transformer that reduce reinforcement learning to a prediction task and solve it via supervised learning (RvS) have become popular due to their simplicity, robustness to hyperparameters, and strong overall performance on offline RL tasks. However, simply conditioning a probabilistic model on a desired return and taking the predicted action can fail dramatically in stochastic environments since trajectories that result in a return may have only achieved that return due to luck. In this work, we describe the limitations of RvS approaches in stochastic environments and propose a solution. Rather than simply conditioning on returns, as is standard practice, our proposed method, ESPER, conditions on learned average returns which are independent from environment stochasticity. Doing so allows ESPER to achieve strong alignment between target return and expected performance in real environments. We demonstrate this in several challenging stochastic offline-RL tasks including the challenging puzzle game 2048, and Connect Four playing against a stochastic opponent. In all tested domains, ESPER achieves significantly better alignment between the target return and achieved return than simply conditioning on returns. ESPER also achieves higher maximum performance than even the value-based baselines. Link » Keiran Paster · Sheila McIlraith · Jimmy Ba 🔗 - Convergence and Price of Anarchy Guarantees of the Softmax Policy Gradient in Markov Potential Games (Poster)  link »    We study the performance of policy gradient methods for the subclass of Markov games known as Markov potential games (MPGs), which extends the notion of normal-form potential games to the stateful setting and includes the important special case of the fully cooperative setting where the agents share an identical reward function. Our focus in this paper is to study the convergence of the policy gradient method for solving MPGs under softmax policy parameterization, both tabular and parameterized with general function approximators such as neural networks. We first show the asymptotic convergence of this method to a Nash equilibrium of MPGs for tabular softmax policies. Second, we derive the finite-time performance of the policy gradient in two settings: 1) using the log-barrier regularization, and 2) using the natural policy gradient under the best-response dynamics (NPG-BR). Finally, extending the notion of price of anarchy (POA) and smoothness in normal-form games, we introduce the POA for MPGs and provide a POA bound for NPG-BR. To our knowledge, this is the first POA bound for solving MPGs. To support our theoretical results, we empirically compare the convergence rates and POA of policy gradient variants for both tabular and neural softmax policies. Link » Dingyang Chen · Qi Zhang · Thinh Doan 🔗 - Fast Convergence for Unstable Reinforcement Learning Problems by Logarithmic Mapping (Poster)  link »    For many of the reinforcement learning applications, the system is assumed to be inherently stable and with bounded reward, state and action space. These are key requirements for the optimization convergence of classical reinforcement learning reward function with discount factors. Unfortunately, these assumptions do not hold true for many real world problems such as an unstable linear–quadratic regulator (LQR). In this work, we propose new methods to stabilize and speed up the convergence of unstable reinforcement learning problems with the policy gradient methods. We provide theoretical insights on the efficiency of our methods. In practice, we achieve good experimental results over multiple examples where the vanilla methods mostly fail to converge due to system instability. Link » WANG ZHANG · Lam Nguyen · Subhro Das · Alexandre Megretsky · Luca Daniel · Tsui-Wei Weng 🔗 - Self-Referential Meta Learning (Poster)  link »    Meta Learning automates the search for learning algorithms. At the same time, it creates a dependency on human engineering on the meta-level, where meta learning algorithms need to be designed. In this paper, we investigate self-referential meta learning systems that modify themselves without the need for explicit meta optimization. We discuss the relationship of such systems to memory-based meta learning and show that self-referential neural networks require functionality to be reused in the form of parameter sharing. Finally, we propose Fitness Monotonic Execution (FME), a simple approach to avoid explicit meta optimization. A neural network self-modifies to solve bandit and classic control tasks, improves its self-modifications, and learns how to learn, purely by assigning more computational resources to better performing solutions. Link » Louis Kirsch · Jürgen Schmidhuber 🔗 - Distributionally Adaptive Meta Reinforcement Learning (Poster)  link »    Meta-reinforcement learning algorithms provide a data-driven way to acquire learning algorithms that quickly adapt to many tasks with varying rewards or dynamics functions. However, learned meta-policies are often effective only on the exact task distribution on which the policy was trained, and struggle in the presence of distribution shift of test-time rewards or transition dynamics. In this work, we develop a framework for meta-RL algorithms that are able to behave appropriately under test-time distribution shifts in the space of tasks. Our framework centers on an adaptive approach to distributional robustness, in which we train a population of meta-agents to be robust to varying levels of distribution shift, so that when evaluated on a (potentially shifted) test-time distribution of tasks, we can adaptively choose the most appropriate meta-agent to follow. We formally show how this framework allows for improved regret under distribution shift, and empirically show its efficacy on simulated robotics problems under a wide range of distribution shifts. Link » Anurag Ajay · Dibya Ghosh · Sergey Levine · Pulkit Agrawal · Abhishek Gupta 🔗 - You Only Live Once: Single-Life Reinforcement Learning via Learned Reward Shaping (Poster)  link » Reinforcement learning algorithms are typically designed to learn a performant policy that can repeatedly and autonomously complete a task, typically starting from scratch. However, many real-world situations operate under a different set of assumptions: the goal might not be to learn a policy that can do the task repeatedly, but simply to perform a new task successfully once, ideally as quickly as possible, and while leveraging some prior knowledge or experience. For example, imagine a robot that is exploring another planet, where it cannot get help or supervision from humans. If it needs to navigate to a crater that it has never seen before in search of water, it does not really need to acquire a policy for reaching craters reliably, it only needs to reach this particular crater once. It must do so without the benefit of episodic resets and tackle a new, unknown terrain, but it can leverage prior experience it acquired on Earth. We formalize this problem setting, which we call single-life reinforcement learning (SLRL), where an agent must complete a task once while contending with some form of novelty in a single trial without interventions, given some prior data. In this setting, we find that algorithms designed for standard episodic reinforcement learning can struggle, as they have trouble recovering from novel states especially when informative rewards are not provided. Motivated by this observation, we also propose an algorithm, $Q$-weighted adversarial learning (QWALE), that addresses the dearth of supervision by employing a distribution matching strategy that leverages the agent's prior experience as guidance in novel situations. Our experiments on several single-life continuous control problems indicate that methods based on our distribution matching formulation are 20-60% more successful because they can more quickly recover from novel, out-of-distribution states. Link » Annie Chen · Archit Sharma · Sergey Levine · Chelsea Finn 🔗 - Discovered Policy Optimisation (Spotlight)  link »    The last decade has been revolutionary for reinforcement learning (RL) — it can now solve complex decision and control problems. Successful RL methods were handcrafted using mathematical derivations, intuition, and experimentation. This approach has a major shortcoming—it results in specific solutions to the RL problem, rather than a protocol for discovering efficient and robust methods. In contrast, the emerging field of meta-learning provides a toolkit for automatic machine learning method optimisation, potentially addressing this flaw. However, black-box approaches which attempt to discover RL algorithms with minimal prior structure have thus far not been successful. Mirror Learning, which includes RL algorithms, such as PPO, offers a potential framework. In this paper we explore the Mirror Learning space by meta-learning a “drift” function. We refer to the result as Learnt Policy Optimisation (LPO). By analysing LPO we gain original insights into policy optimisation which we use to formulate a novel, closed-form RL algorithm, Discovered Policy Optimisation (DPO). Our experiments in Brax environments confirm state-of-the-art performance of LPO and DPO, as well as their transfer to unseen settings. Link » Christopher Lu · Jakub Grudzien Kuba · Alistair Letcher · Luke Metz · Christian Schroeder · Jakob Foerster 🔗 - Directed Exploration via Uncertainty-Aware Critics (Poster)  link » The exploration-exploitation dilemma is still an open problem in Reinforcement Learning (RL), especially when coped with deep architectures and in the context of continuous action spaces. Uncertainty quantification has been extensively used as a means to achieve efficient directed exploration. However, state-of-the-art methods for continuous actions still suffer from high sample complexity requirements. Indeed, they either completely lack strategies for propagating the epistemic uncertainty throughout the updates, or they mix it with aleatory uncertainty while learning the full return distribution (e.g., distributional RL). In this paper, we propose Wasserstein Actor-Critic (WAC), an actor-critic architecture inspired by the recent Wasserstein Q-Learning (WQL) (Metelli et al., 2019), that employs approximate Q-posteriors to represent the epistemic uncertainty and Wasserstein barycenters for uncertainty propagation across the state-action space. WAC enforces exploration in a principled way by guiding the policy learning process with the optimization of an upper bound of the Q-value estimates. Furthermore, we study some peculiar issues that arise when using function approximation, coupled with the uncertainty estimation, and propose a regularized loss for the uncertainty estimation. Finally, we evaluate our algorithmon a suite of continuous-actions domains, where exploration is crucial, in comparison with state-of-the-art baselines. Our experiments show a clear benefit of using uncertainty-aware critics for continuous-actions control. Link » Amarildo Likmeta · Matteo Sacco · Alberto Maria Metelli · Marcello Restelli 🔗 - Adversarial Cheap Talk (Poster)  link » Adversarial attacks in reinforcement learning (RL) often assume highly-privileged access to the learning agent’s parameters, environment or data. Instead, this paper proposes a novel adversarial setting called a Cheap Talk MDP in which an Adversary has a minimal range of influence over the Victim. Parameterised as a deterministic policy that only conditions on the current state, an Adversary can merely append information to a Victim’s observation. To motivate the minimum-viability, we prove that in this setting the Adversary cannot occlude the ground truth, influence the underlying dynamics of the environment, introduce non-stationarity, add stochasticity, see the Victim’s actions, or access their parameters. Additionally, we present a novel meta-learning algorithm to train the Adversary, called adversarial cheap talk (ACT). Using ACT, we demonstrate that the resulting Adversary still manages to influence the Victim’s training and test performance despite these restrictive assumptions. Affecting train-time performance reveals a new attack vector and provides insight into the success and failure modes of existing RL algorithms. More specifically, we show that an ACT Adversary is capable of harming performance by interfering with the learner’s function approximation and helping the Victim’s performance by appending useful features. Finally, we demonstrate that an ACT Adversary can append information during train-time to directly and arbitrarily control the Victim at test-time in a zero-shot manner. Link » Christopher Lu · Timon Willi · Alistair Letcher · Jakob Foerster 🔗 - Adaptive Intrinsic Motivation with Decision Awareness (Poster)  link »    Intrinsic motivation is a simple but powerful method to encourage exploration, which is one of the fundamental challenges of reinforcement learning. However, we demonstrate that widely used intrinsic motivation methods are highly dependent on the ratio between the extrinsic and intrinsic rewards through extensive experiments on sparse reward MiniGrid tasks. To overcome the problem, we propose an intrinsic reward coefficient adaptation scheme that is equipped with intrinsic motivation awareness and adjusts the intrinsic reward coefficient online to maximize the extrinsic return. We demonstrate that our method, named Adaptive Intrinsic Motivation with Decision Awareness (AIMDA), operates stably in various challenging MiniGrid environments without algorithm-task-specific hyperparameter tuning. Link » Suyoung Lee · Sae-Young Chung 🔗 - Leveraging Factored Action Spaces for Efficient Offline Reinforcement Learning in Healthcare (Poster)  link »    Many reinforcement learning (RL) applications have combinatorial action spaces, where each action is a composition of sub-actions. A standard RL approach ignores this inherent factorization structure, resulting in a potential failure to make meaningful inferences about rarely observed sub-action combinations; this is particularly problematic for offline settings, where data may be limited. In this work, we propose a form of linear Q-function decomposition induced by factored action spaces. We study the theoretical properties of our approach, identifying scenarios where it is guaranteed to lead to zero bias when used to approximate the Q-function. Outside the regimes with theoretical guarantees, we show that our approach can still be useful because it leads to better sample efficiency without necessarily sacrificing policy optimality, allowing us to achieve a better bias-variance trade-off. Across several offline RL problems using simulators and real-world datasets motivated by healthcare problems, we demonstrate that incorporating factored action spaces into value-based RL can result in better-performing policies. Our approach can help an agent make more accurate inferences within under-explored regions of the state-action space when applying RL to observational datasets. Link » Shengpu Tang · Maggie Makar · Michael Sjoding · Finale Doshi-Velez · Jenna Wiens 🔗 - Dynamic Update-to-Data Ratio: Minimizing World Model Overfitting (Poster)  link » Early stopping based on the validation set performance is a popular approach to find the right balance between under- and overfitting in the context of supervised learning. However, in reinforcement learning, even for supervised sub-problems such as world model learning, early stopping is not applicable as the dataset is continually evolving. As a solution, we propose a new general method that dynamically adjusts the update to data (UTD) ratio during training based on under- and overfitting detection on a small subset of the continuously collected experience not used for training. We apply our method to DreamerV2, a state-of-the-art model-based reinforcement learning algorithm, and evaluate it on the DeepMind Control Suite and the Atari k benchmark. The results demonstrate that one can better balance under- and overestimation by adjusting the UTD ratio with our approach compared to the default setting in DreamerV2 and that it is competitive with an extensive hyperparameter search which is not feasible for many applications. Our method eliminates the need to set the UTD hyperparameter by hand and even leads to a higher robustness with regard to other learning-related hyperparameters further reducing the amount of necessary tuning. Link » Nicolai Dorka · Tim Welschehold · Wolfram Burgard 🔗 - Task Factorization in Curriculum Learning (Poster)  link »    A common challenge for learning when applied to a complex target'' task is that learning that task all at once can be too difficult due to inefficient exploration given a sparse reward signal. Curriculum Learning addresses this challenge by sequencing training tasks for a learner to facilitate gradual learning. One of the crucial steps in finding a suitable curriculum learning approach is to understand the dimensions along which the domain can be factorized. In this paper, we identify different types of factorizations common in the literature of curriculum learning for reinforcement learning tasks: factorizations that involve the agent, the environment, or the mission. For each factorization category, we identify the relevant algorithms and techniques that leverage that factorization and present several case studies to showcase how leveraging an appropriate factorization can boost learning using a simple curriculum. Link » Reuth Mirsky · Shahaf Shperberg · Yulin Zhang · Zifan Xu · Yuqian Jiang · Jiaxun Cui · Peter Stone 🔗 - SAFER: Data-Efficient and Safe Reinforcement Learning via Skill Acquisition (Poster)  link »    Though many reinforcement learning (RL) problems involve learning policies in settings with difficult-to-specify safety constraints and sparse rewards, current methods struggle to acquire successful and safe policies. Methods that extract useful policy primitives from offline datasets using generative modeling have recently shown promise at accelerating RL in these more complex settings. However, we discover that current primitive-learning methods may not be well-equipped for safe policy learning and may promote usafe behavior due to their tendency to ignore data from undesirable behaviors. To improve the safety of offline skill learning algorithms, we propose SAFEty skill pRiors, an algorithm that accelerates policy learning on complex control tasks under safety constraints. Through principled training on an offline dataset, SAFER learns to extract safe primitive skills. In the inference stage, policies trained with SAFER learn to compose safe skills into successful policies. We theoretically characterize why SAFER can enforce safe policy learning and demonstrate its effectiveness on several complex safety-critical robotic grasping tasks inspired by the game Operation, in which SAFER outperforms baseline methods in learning successful policies and enforcing safety. Link » Dylan Slack · Yinlam Chow · Bo Dai · Nevan Wichers 🔗 - Guided Exploration in Reinforcement Learning via Monte Carlo Critic Optimization (Poster)  link »    The class of deep deterministic off-policy algorithms is effectively applied to solve challenging continuous control problems. However, current approaches use random noise as a common exploration method that has several weaknesses, such as a need for manual adjusting on a given task and the absence of exploratory calibration during the training process. We address these challenges by proposing a novel guided exploration method that uses a differential directional controller to incorporate scalable exploratory action correction. An ensemble of Monte Carlo Critics that provides exploratory direction is presented as a controller. The proposed method improves the traditional exploration scheme by changing exploration dynamically. We then present a novel algorithm exploiting the proposed directional controller for both policy and critic modification. The presented algorithm outperforms modern algorithms across a variety of problems from DMControl suite. Link » Igor Kuznetsov 🔗 - Live in the Moment: Learning Dynamics Model Adapted to Evolving Policy (Spotlight)  link »    Model-based reinforcement learning (RL) achieves higher sample efficiency in practice than model-free RL by learning a dynamics model to generate samples for policy learning. Previous works learn a `global'' dynamics model to fit the state-action visitation distribution for all historical policies. However, in this paper, we find that learning a global dynamics model does not necessarily benefit model prediction for the current policy since the policy in use is constantly evolving. The evolving policy during training will cause state-action visitation distribution shifts. We theoretically analyze how the distribution of historical policies affects the model learning and model rollouts. We then propose a novel model-based RL method, named \textit{Policy-adaptation Model-based Actor-Critic (PMAC)}, which learns a policy-adapted dynamics model based on a policy-adaptation mechanism. This mechanism dynamically adjusts the historical policy mixture distribution to ensure the learned model can continually adapt to the state-action visitation distribution of the evolving policy. Experiments on a range of continuous control environments in MuJoCo show that PMAC achieves state-of-the-art asymptotic performance and almost two times higher sample efficiency than prior model-based methods. Link » xiyao wang · Wichayaporn Wongkamjan · Furong Huang 🔗 - Deciding What to Model: Value-Equivalent Sampling for Reinforcement Learning (Poster)  link » The quintessential model-based reinforcement-learning agent iteratively refines its estimates or prior beliefs about the true underlying model of the environment. Recent empirical successes in model-based reinforcement learning with function approximation, however, eschew the true model in favor of a surrogate that, while ignoring various facets of the environment, still facilitates effective planning over behaviors. Recently formalized as the value equivalence principle, this algorithmic technique is perhaps unavoidable as real-world reinforcement learning demands consideration of a simple, computationally-bounded agent interacting with an overwhelmingly complex environment, whose underlying dynamics likely exceed the agent's capacity for representation. In this work, we consider the scenario where agent limitations may entirely preclude identifying an exactly value-equivalent model, immediately giving rise to a trade-off between identifying a model that is simple enough to learn while only incurring bounded sub-optimality. To address this problem, we introduce an algorithm that, using rate-distortion theory, iteratively computes an approximately-value-equivalent, lossy compression of the environment which an agent may feasibly target in lieu of the true model. We prove an information-theoretic, Bayesian regret bound for our algorithm that holds for any finite-horizon, episodic sequential decision-making problem. Crucially, our regret bound can be expressed in one of two possible forms, providing a performance guarantee for finding either the simplest model that achieves a desired sub-optimality gap or, alternatively, the best model given a limit on agent capacity. Link » Dilip Arumugam · Benjamin Van Roy 🔗 - Generalization of Reinforcement Learning with Policy-Aware Adversarial Data Augmentation (Poster)  link »    The generalization gap in reinforcement learning (RL) has been a significant obstacle that prevents the RL agent from learning general skills and adapting to varying environments. Increasing the generalization capacity of the RL systems can significantly improve their performance on real-world working environments. In this work, we propose a novel policy-aware adversarial data augmentation method to augment the standard policy learning method with automatically generated trajectory data. Different from the commonly used observation transformation based data augmentations, our proposed method adversarially generates new trajectory data based on the policy gradient objective and aims to more effectively increase the RL agent’s generalization ability with the policy-aware data augmentation. Moreover, we further deploy a mixup step to integrate the original and generated data to enhance the generalization capacity while mitigating the over-deviation of the adversarial data. We conduct experiments on a number of RL tasks to investigate the generalization performance of the proposed method by comparing it with the standard baselines and the state-of-the-art mixreg approach. The results show our method can generalize well with limited training diversity, and achieve the state-of-the-art generalization test performance. Link » Hanping Zhang · Yuhong Guo 🔗 - MEPG: A Minimalist Ensemble Policy Gradient Framework for Deep Reinforcement Learning (Poster)  link »    During the training of a reinforcement learning (RL) agent, the distribution of training data is non-stationary as the agent's behavior changes over time. Therefore, there is a risk that the agent is overspecialized to a particular distribution and its performance suffers in the larger picture. Ensemble RL can mitigate this issue by learning a robust policy. However, it suffers from heavy computational resource consumption due to the newly introduced value and policy functions. In this paper, to avoid the notorious resources consumption issue, we design a novel and simple ensemble deep RL framework that integrates multiple models into a single model. Specifically, we propose the Minimalist Ensemble Policy Gradient framework (MEPG), which introduces minimalist ensemble consistent Bellman update by utilizing a modified dropout operator. MEPG holds ensemble property by keeping the dropout consistency of both sides of the Bellman equation. Additionally, the dropout operator also increases MEPG's generalization capability. Moreover, we theoretically show that the policy evaluation phase in the MEPG maintains two synchronized deep Gaussian Processes. To verify the MEPG framework's ability to generalize, we perform experiments on the gym simulator, which presents that the MEPG framework outperforms or achieves a similar level of performance as the current state-of-the-art ensemble methods and model-free methods without increasing additional computational resource costs. Link » Qiang He · Huangyuan Su · Chen GONG · Xinwen Hou 🔗