Timezone: »

 
Workshop
Reinforcement Learning for Real Life
Yuxi Li · Minmin Chen · Omer Gottesman · Lihong Li · Zongqing Lu · Rupam Mahmood · Niranjani Prasad · Zhiwei (Tony) Qin · Csaba Szepesvari · Matthew Taylor

Fri Jul 23 06:00 AM -- 10:00 PM (PDT) @ None
Event URL: https://sites.google.com/view/RL4RealLife »

Reinforcement learning (RL) is a general learning, predicting, and decision making paradigm and applies broadly in many disciplines, including science, engineering and humanities. RL has seen prominent successes in many problems, such as games, robotics, recommender systems. However, applying RL in the real world remains challenging, and a natural question is:

Why isn’t RL used even more often and how can we improve this?

The main goals of the workshop are to: (1) identify key research problems that are critical for the success of real-world applications; (2) report progress on addressing these critical issues; and (3) have practitioners share their success stories of applying RL to real-world problems, and the insights gained from such applications.

We invite paper submissions successfully applying RL algorithms to real-life problems and/or addressing practically relevant RL issues. Our topics of interest are general, including (but not limited to): 1) practical RL algorithms, which covers all algorithmic challenges of RL, especially those that directly address challenges faced by real-world applications; 2) practical issues: generalization, sample efficiency, exploration, reward, scalability, model-based learning, prior knowledge, safety, accountability, interpretability, reproducibility, hyper-parameter tuning; and 3) applications.

We have 6 premier panel discussions and 70+ great papers/posters. Welcome!

Fri 6:00 a.m. - 8:00 a.m.
 link »

Poster rooms: https://eventhosts.gather.town/FRMWWpHa7SmXcSfJ/rl4reallife-9-11-1 https://eventhosts.gather.town/YiR8LgF0UI99lNI0/rl4reallife-9-11-2

Parent room: https://eventhosts.gather.town/Fr6NslMRGcIIBTGx/rl4reallife-900-1100

Fri 8:00 a.m. - 9:00 a.m.
  

Panelists: Matthew Botvinick (Deepmind), Thomas Dietterich (Oregon State U.), Leslie Pack Kaelbling (MIT), John Langford (Microsoft)(Moderator), Warren Powell (Princeton & Optimal Dynamics)

Co-Chairs: Csaba Szepesvari (Deepmind & U. of Alberta), Lihong Li (Amazon) and Yuxi Li (Attain.ai)

Matthew Botvinick, Thomas Dietterich, Leslie Kaelbling, John Langford, Warrren B Powell, Csaba Szepesvari, Lihong Li, Yuxi Li
Fri 9:00 a.m. - 10:00 a.m.
  

Panelists: Ofra Amir (Technion), Finale Doshi-Velez (Harvard), Alan Fern (Oregon State), Zachary Lipton (CMU)

Co-Chairs/Moderators: Omer Gottesman (Brown U.) and Niranjani Prasad (Microsoft)

Ofra Amir, Finale Doshi-Velez, Alan Fern, Zachary Lipton, Omer Gottesman, Niranjani Prasad
Fri 10:00 a.m. - 11:00 a.m.
  

Panelists: George Konidaris (Brown), Jan Peters (TU Darmstadt), Martin Riedmiller (Deepmind), Angela Schoellig (U. of Toronto), Rose Yu (UCSD)

Chair/Moderator: Rupam Mahmood (U. of Alberta)

George Konidaris, Jan Peters, Martin Riedmiller, Angela Schoellig, Rose Yu, Rupam Mahmood
Fri 11:00 a.m. - 3:00 p.m.
Break
Fri 3:00 p.m. - 4:00 p.m.
  

Panelists: Alekh Agarwal (Microsoft), Ed Chi (Google), Maria Dimakopoulo (Netflix), Georgios Theocharous (Adobe)

Co-Chairs/Moderators: Minmin Chen (Google) and Lihong Li (Amazon)

Alekh Agarwal, Ed Chi, Maria Dimakopoulou, Georgios Theocharous, Minmin Chen, Lihong Li
Fri 4:00 p.m. - 5:00 p.m.
Spotlight   
Zhiwei (Tony) Qin, Xianyuan Zhan, Meng Qi, Ruihan Yang, Philip Ball, Hamsa Bastani, Yao Liu, Xiuwen Wang, Haoran Xu, Tony Z. Zhao, Lili Chen, Aviral Kumar
Fri 5:00 p.m. - 6:00 p.m.
  

Panelists: Craig Buhr (MathWorks), Jeff Mendenhall (Microsoft), Xiaocheng Tang (Didi), Yang Yu (Polixir.ai / Nanjing U.)

Co-Chairs/Moderators: Matthew E. Taylor (U. of Alberta) and Kathryn Hume (Borealis AI)

Craig Buhr, Jeff Mendenhall, Yang Yu, Matthew Taylor
Fri 7:00 p.m. - 8:00 p.m.
  

Panelists: Jim Dai (Cornell/CUHK), Fei Fang (CMU), Shie Mannor (Technion & Nvidia Research), Yuandong Tian (Facebook AI Research)

Co-Chairs: Zhiwei (Tony) Qin (Didi) and Zongqing Lu (PKU) (Moderator)

Jim Dai, Fei Fang, Shie Mannor, Yuandong Tian, Zhiwei (Tony) Qin, Zongqing Lu
Fri 8:00 p.m. - 10:00 p.m.
 link »

Poster rooms: https://eventhosts.gather.town/PHQYRlB1BHUwBhO5/rl4reallife-23-1-1 https://eventhosts.gather.town/mQyaLuUsQkOBGuQm/rl4reallife-23-1-2

Parent room: https://eventhosts.gather.town/Qel2j6xBjlzx7zCJ/rl4reallife-2300-100

Fri 10:00 p.m. - 10:00 p.m.
Workshop ends (Break)
-
[ Visit Poster at Spot A2 in Virtual World ]

Optimizing the combustion efficiency of a thermal power generating unit (TPGU) is a highly challenging and critical task in the energy industry. We develop a new data-driven AI system, namely DeepThermal, to optimize the combustion control strategy for TPGUs. At its core, is a new model-based offline reinforcement learning (RL) framework, called MORE, which leverages historical operational data of a TGPU to solve a highly complex constrained Markov decision process problem via purely offline training. In DeepThermal, we first learn a data-driven combustion process simulator from the offline dataset. The RL agent of MORE is then trained by combining real historical data as well as carefully filtered and processed simulation data through a novel restrictive exploration scheme. DeepThermal has been successfully deployed in four large coal-fired thermal power plants in China. Real-world experiments show that DeepThermal effectively improves the combustion efficiency of TPGUs. We also report the superior performance of MORE by comparing with the state-of-the-art algorithms on the standard offline RL benchmarks.

Xianyuan Zhan, Haoran Xu, Yue Zhang, Xiangyu Zhu, Honglei Yin
-
[ Visit Poster at Spot B4 in Virtual World ]

In modern video encoders, rate control is a critical component and has been heavily engineered. It decides how many bits to spend to encode each frame, in order to optimize the rate-distortion trade-off over all video frames. This is a challenging constrained planning problem because of the complex dependency among decisions for different video frames and the bitrate constraint defined at the end of the episode.

We formulate the rate control problem as a Partially Observable Markov Decision Process (POMDP), and apply imitation learning to learn a neural rate control policy. We demonstrate that by learning from optimal video encoding trajectories obtained through evolution strategies, our learned policy achieves better encoding efficiency and has minimal constraint violation. In addition to imitating the optimal actions, we find that additional auxiliary losses, data augmentation/refinement and inference-time policy improvements are critical for learning a good rate control policy. We evaluate the learned policy against the rate control policy in libvpx, a widely adopted open source VP9 codec library, in the two-pass variable bitrate (VBR) mode. We show that over a diverse set of real-world videos, our learned policy achieves 8.5% median bitrate reduction without sacrificing video quality.

Hongzi Mao, Chenjie Gu, Miaosen Wang, Angie Chen, Nevena Lazic, Nir Levine, Derek Pang, Rene Claus, Marisabel Hechtman, Ching-Han Chiang, Cheng Chen, Jingning Han
-
[ Visit Poster at Spot C1 in Virtual World ]

Mixed integer programming (MIP) is a general optimization technique with various real-world applications. Finding feasible solutions for MIP problems is critical because many successful heuristics rely on a known initial feasible solution. However, it is in general NP-hard. In this work, we propose a deep reinforcement learning (DRL) model that efficiently finds a feasible solution for a general type of MIPs. In particular, we develop a smart feasibility pump (SFP) method empowered by DRL, inspired by Feasibility Pump (FP), a popular heuristic for searching feasible MIP solutions. Numerical experiments on various problem instances show that SFP significantly outperforms the classic FP in terms of the number of steps required to reach the first feasible solution. We consider two different structures for the policy network. The classic perception (MLP) and a novel convolution neural network (CNN) structure. The CNN captures the hidden information of the constraint matrix of the MIP problem and relieves the burden of calculating the projection of the current solution as the input at each time step. This highlights the representational power of the CNN structure.

Mengxin Wang, Meng Qi, Zuo-Jun Shen
-
[ Visit Poster at Spot B5 in Virtual World ]

Reliant on too many experiments to learn good actions, current Reinforcement Learning (RL) algorithms have limited applicability in real-world settings, which can be too expensive to allow exploration. We propose an algorithm for batch RL, where effective policies are learned using only a fixed offline dataset instead of online interactions with the environment. The limited data in batch RL produces inherent uncertainty in value estimates of states/actions that were insufficiently represented in the training data. This leads to particularly severe extrapolation when our candidate policies diverge from one that generated the data. We propose to mitigate this issue via two straightforward penalties: a policy-constraint to reduce this divergence and a value-constraint that discourages overly optimistic estimates. Over a comprehensive set of 32 continuous-action batch RL benchmarks, our approach compares favorably to state-of-the-art methods, regardless of how the offline data were collected.

Rasool Fakoor, Jonas Mueller, Kavosh Asadi, Pratik Chaudhari, Alex Smola
-
[ Visit Poster at Spot C0 in Virtual World ]

The influence maximization (IM) problem aims at finding a subset of seed nodes in a social network that maximize the spread of influence. In this study, we focus on a sub-class of IM problems, where whether the nodes are willing to be the seeds when being invited is uncertain, called \textit{contingency-aware IM}. Such contingency aware IM is critical for applications for non-profit organizations in low resource communities (e.g., spreading awareness of disease prevention). Despite the initial success, a major practical obstacle in promoting the solutions to more communities is the tremendous runtime of the greedy algorithms and the lack of high performance computing (HPC) for the non-profits in the field -- whenever there is a new social network, the non-profits usually do not have the HPCs to recalculate the solutions. Motivated by this and inspired by the line of works that use reinforcement learning (RL) to address combinatorial optimization on graphs, we formalize the problem as a Markov Decision Process (MDP), and use RL to learn an IM policy over historically seen networks, and generalize to unseen networks with negligible runtime at test phase. To fully exploit the properties of our targeted problem, we propose two technical innovations that improve the existing methods, including state-abstraction and theoretically grounded reward shaping. Empirical results show that our method achieves influence as high as the state-of-the-art methods for contingency-aware IM, while having negligible runtime at test phase.

Haipeng Chen, Wei Qiu, Han-Ching Ou, Bo An, Milind Tambe
-
[ Visit Poster at Spot C6 in Virtual World ]

Combinatorial optimization problems (COPs) on the graph with real-life applications are canonical challenges in Computer Science. The difficulty of finding quality labels for problem instances holds back leveraging supervised learning across combinatorial problems. Reinforcement learning (RL) algorithms have recently been adopted to solve this challenge automatically. The underlying principle of this approach is to deploy a graph neural network (GNN) for encoding both the local information of the nodes and the graph-structured data in order to capture the current state of the environment. Then, it is followed by the actor to learn the problem-specific heuristics on its own and make an informed decision at each state for finally reaching a good solution. Recent studies on this subject mainly focus on a family of combinatorial problems on the graph, such as the travel salesman problem, where the proposed model aims to find an ordering of vertices that optimizes a given objective function. We use the security-aware phone clone allocation in the cloud as a classical quadratic assignment problem (QAP) to investigate whether or not deep RL-based model is generally applicable to solve other classes of such hard problems. Extensive empirical evaluation shows that existing RL-based model may not generalize to QAP.

Mostafa Pashazadeh, Kui Wu
-
[ Visit Poster at Spot B1 in Virtual World ]

In real world recommender systems, we often aim to optimize \textit{ranking} decision making. In these applications, \textit{off-policy evaluation} (OPE) is beneficial because it enables performance estimation of unknown ranking policies using only logged data. However, naive application of OPE for ranking policies faces a critical variance issue. To tackle the issue, we often introduce user behavior assumptions to make combinatorial item space tractable. However, a strong assumption may in turn cause serious bias in the performance estimation. Therefore, it is important to appropriately control the bias-variance tradeoff by imposing a reasonable assumption. To achieve this, we propose \textit{doubly robust} (DR) estimator for ranking policies that works under the \textit{cascade} assumption. Since the cascade assumption assumes that a user interacts with items sequentially from the top position to the bottom, it is more reasonable than assuming that a user interacts with items independently. The proposed estimator leads to an unbiased estimation in more cases compared to the existing estimator built on the independence assumption. Furthermore, compared to the previous estimator built on the same cascade assumption, DR reduces the variance under a reasonable assumption. Finally, the experiments show that the proposed estimator works favorably on various synthetic settings.

Haruka Kiyohara, Yuta Saito, Tatsuya Matsuhiro, Yusuke Narita, Nobuyuki Shimizu, Yasuo Yamamoto
-
[ Visit Poster at Spot A6 in Virtual World ]

Success stories of applied machine learning can be traced back to the datasets and environments that were put forward as challenges for the community. The challenge that the community sets as a benchmark is usually the challenge that the community eventually solves. The ultimate challenge of reinforcement learning research is to train real agents to operate in the real environment, but there is no common real-world benchmark to track the progress of RL on physical robotic systems. To address this issue we have created a physical RL benchmark -- a collection of real-world environments for reinforcement learning in robotics with free public remote access. In this work, we introduce four tasks in two environments and the experimental results on one of them that demonstrate the feasibility of learning on a real robotic system. We train a mobile robot end-to-end to solve a visual navigation task relying solely on camera input and without the access to location information. Close integration into existing ecosystem allows the community to start using the proposed system without any prior experience in robotics and takes away the burden of managing a physical robotics system, abstracting it under a familiar API.

Ashish Kumar, Toby Buckley, John Lanier, Qiaozhi Wang, Alicia Kavelaars, Ilya Kuzovkin
-
[ Visit Poster at Spot C6 in Virtual World ]

Training-time safety violations have been a major concern when we deploy reinforcement learning algorithms in the real world. This paper explores the possibility of safe RL algorithms with zero training-time safety violations in the challenging setting where we are only given a safe but trivial-reward initial policy without any prior knowledge of the dynamics and additional offline data. We propose an algorithm, Co-trained Barrier Certificate for Safe RL (CRABS),which iteratively learns barrier certificates, dynamics models, and policies. The barrier certificates are learned via adversarial training and ensure the policy's safety assuming calibrated learned dynamics. We also add a regularization term to encourage larger certified regions to enable better exploration. Empirical simulations show that zero safety violations are already challenging for a suite of simple environments with only 2-4 dimensional state space, especially if high-reward policies have to visit regions near the safety boundary. Prior methods require hundreds of violations to achieve decent rewards on these tasks, whereas our proposed algorithms incur zero violations.

Yuping Luo, Tengyu Ma
-
[ Visit Poster at Spot C5 in Virtual World ]

The use of Reinforcement Learning (RL) agents in practical applications requires the consideration of suboptimal outcomes, depending on the familiarity of the agent with its environment. This is especially important in safety-critical environments, where errors can lead to high costs or damage. In distributional RL, the risk-sensitivity can be controlled via different distortion measures of the estimated return distribution. However, these distortion functions require an estimate of the risk level, which is difficult to obtain and depends on the current state. In this work, we demonstrate the suboptimality of a static risk level estimation and propose a method to dynamically select risk levels at each environment step. Our method ARA (Automatic Risk Adaptation) estimates the appropriate risk level in both known and unknown environments using a Random Network Distillation error. We show reduced failure rates by up to a factor of 7 and improved generalization performance by up to 14% compared to both risk-aware and risk-agnostic agents in several locomotion environments.

Frederik Schubert, Theresa Eimer, Bodo Rosenhahn, Marius Lindauer
-
[ Visit Poster at Spot B0 in Virtual World ]

The control variates (CV) method is widely used in policy gradient estimation to reduce the variance of the gradient estimators in practice. A control variate is applied by subtracting a baseline function from the state-action value estimates. Then the variance-reduced policy gradient presumably leads to higher learning efficiency. Recent research on control variates with deep neural net policies mainly focuses on scalar-valued baseline functions. The effect of vector-valued baselines is under-explored. This paper investigates variance reduction with coordinate-wise and layer-wise control variates constructed from vector-valued baselines for neural net policies. We present experimental evidence suggesting that lower variance can be obtained with such baselines than with the conventional scalar-valued baseline. We demonstrate how to equip the popular Proximal Policy Optimization (PPO) algorithm with these new control variates. We show that the resulting algorithm with proper regularization can achieve higher sample efficiency than scalar control variates in continuous control benchmarks.

Yuanyi Zhong, Yuan Zhou, Jian Peng
-
[ Visit Poster at Spot A2 in Virtual World ]

We address the problem of solving complex bimanual robot manipulation tasks on multiple objects with sparse rewards. Such complex tasks can be decomposed into sub-tasks that are accomplishable by different robots concurrently or sequentially for better efficiency. While previous reinforcement learning approaches primarily focus on modeling the compositionality of sub-tasks, two fundamental issues are largely ignored particularly when learning cooperative strategies for two robots: (i) domination, i.e., one robot may try to solve a task by itself and leaves the other idle; (ii) conflict, i.e., one robot can easily interrupt another's workspace when executing different sub-tasks simultaneously. To tackle these two issues, we propose a novel technique called disentangled attention, which provides an intrinsic regularization for two robots to focus on separate sub-tasks and objects. We evaluate our method on four bimanual manipulation tasks. Experimental results show that our proposed intrinsic regularization successfully avoids domination and reduces conflicts for the policies, which leads to significantly more effective cooperative strategies than all the baselines.

Minghao Zhang, Pingcheng Jian, Yi Wu, Harry (Huazhe) Xu, Xiaolong Wang
-
[ Visit Poster at Spot B0 in Virtual World ]

We propose to solve quadrupedal locomotion tasks using Reinforcement Learning (RL) with a Transformer-based model that learns to combine proprioceptive information and high-dimensional depth sensor inputs. While learning-based loco-motion has made great advances using RL, most methods still rely on domain randomization for training blind agents that generalize to challenging terrains. Our key insight is that proprioceptive states only offer contact measurements for immediate reaction, whereas an agent equipped with visual sensory observations can learn to proactively maneuver environments with obstacles and uneven terrain by anticipating changes in the environment many steps ahead. In this paper, we introduce LocoTransformer, an end-to-end RL method for quadrupedal locomotion that lever-ages a Transformer-based model for fusing proprioceptive states and visual observations. We evaluate our method in challenging simulated environments with different obstacles and uneven terrain. We show that our method obtains significant improvements over policies with only proprioceptive state inputs and that Transformer-based models further improve generalization across environments. Our project page with videos is athttps://LocoTransformer.github.io/.

Ruihan Yang, Minghao Zhang, Nicklas Hansen, Harry (Huazhe) Xu, Xiaolong Wang
-
[ Visit Poster at Spot A2 in Virtual World ]

Autonomous mobility-on-demand (AMoD) systems represent a rapidly developing mode of transportation wherein travel requests are dynamically handled by a coordinated fleet of robotic, self-driving vehicles. Given a graph representation of the transportation network - one where, for example, nodes represent areas of the city, and edges the connectivity between them - we argue that the AMoD control problem is naturally cast as a node-wise decision-making problem. In this paper, we propose a deep reinforcement learning framework to control the rebalancing of AMoD systems through graph neural networks. Crucially, we demonstrate that graph neural networks enable reinforcement learning agents to recover behavior policies that are significantly more transferable, generalizable, and scalable than policies learned through other approaches. Empirically, we show how the learned policies exhibit promising zero-shot transfer capabilities when faced with critical portability tasks such as inter-city generalization, service area expansion, and adaptation to potentially complex urban topologies.

Daniele Gammelli, Kaidi Yang, James Harrison, Filipe Rodrigues, Francisco Pereira, Marco Pavone
-
[ Visit Poster at Spot B1 in Virtual World ]

We investigate how effective an attacker can be when it only learns from its victim's actions, without access to the victim's reward. In this work, we are motivated by the scenario where the attacker wants to disrupt real-world RL applications, such as autonomous vehicles or delivery drones, without knowing the victim's precise goals or reward function. We argue that one heuristic approach an attacker can use is to strategically maximize the entropy of the victim's policy. The policy is generally not obfuscated, which implies it may be extracted simply by passively observing the victim. We provide such a strategy in the form of a reward-free exploration algorithm that maximizes the attacker's entropy during the exploration phase, and it then maximizes the victim's empirical entropy during the planning phase. In our experiments, the victim agents are subverted through policy entropy maximization, implying an attacker might not need access to the victim’s reward to succeed. Hence, even if the victim's reward information is protected, reward-free attacks, based only on observing behavior, underscore the need to better understand policy obfuscation when preparing to deploy reinforcement learning in real world applications.

Ted Fujimoto, Timothy Doster, Adam Attarian, Jill Brandenberger, Nathan Hodas
-
[ Visit Poster at Spot A3 in Virtual World ]

Deep Reinforcement Learning (DRL) is considered a potential framework to improve many real-world autonomous systems; it has attracted the attention of multiple and diverse fields. Nevertheless, the successful deployment in the real world is a test most of DRL models still need to pass. In this work we focus on this issue by reviewing and evaluating the research efforts from both domain-agnostic and domain-specific communities. On one hand, we offer a comprehensive summary of DRL challenges and summarize the different proposals to mitigate them; this helps identifying five gaps of domain-agnostic research. On the other hand, from the domain-specific perspective, we discuss different success stories and argue why other models might fail to be deployed. Finally, we take up on ways to move forward accounting for both perspectives.

Juan Jose Garau Luis, Edward Crawley, Bruce Cameron
-
[ Visit Poster at Spot A0 in Virtual World ]
We study the adversarial robustness in offline reinforcement learning. Given a batch dataset consisting of tuples $(s, a, r, s')$, an adversary is allowed to arbitrarily modify $\epsilon$ fraction of the tuples. From the corrupted dataset the learner aims to robustly identify a near-optimal policy. We first show that a worst-case $\Omega(d\epsilon)$ optimality gap is unavoidable in linear MDP of dimension $d$, even if the adversary only corrupts the reward element in a tuple. This contrasts with dimension-free results in robust supervised learning and best-known lower-bound in the online RL setting with corruption. Next, we propose a robust variant of the Least-Square Value Iteration (LSVI) algorithm utilizing robust supervised learning oracles, that achieves near-matching performances in cases both with and without full data coverage. The algorithm requires the knowledge of $\epsilon$ to design the pessimism bonus in the no-coverage case. Surprisingly, in this case, the knowledge of $\epsilon$ is necessary, as we show that being adaptive to unknown $\epsilon$ is impossible. This again contrasts with recent results on corruption-robust online RL and implies that robust offline RL is a strictly harder problem.
Xuezhou Zhang, Yiding Chen, Jerry Zhu, Wen Sun
-
[ Visit Poster at Spot C2 in Virtual World ]

In the industrial interior design process, professional designers plan the furniture layout to achieve a satisfactory 3D design for selling. In this paper, we explore the interior graphic scenes design task as a Markov decision process (MDP) in 3D simulation, which is solved by deep reinforcement learning. The goal is to produce a proper furniture layout in the 3D simulation of the indoor graphic scenes. In particular, we first transform the 3D interior graphic scenes into two 2D simulation scenes. We then design the simulated environment and apply two reinforcement learning agents to learn the optimal 3D layout for the MDP formulation in a cooperative way. We conduct our experiments on a large-scale real-world interior layout dataset that contains industrial designs from professional designers. Our numerical results demonstrate that the proposed model yields higher-quality layouts as compared with the state-of-art model.

xinhan di, Pengqian Yu
-
[ Visit Poster at Spot D3 in Virtual World ]

The real world is large and complex. It is filled with many objects besides those defined by a task and objects can move with their own interesting dynamics. How should an agent learn to represent state to support efficient learning and generalization in such an environment? In this work, we present a novel memory architecture, Perceptual Schemata, for learning and zero-shot generalization in environments that have many, potentially moving objects. Perceptual Schemata represents state using a combination of schema modules that each learn to attend to and maintain stateful representations of different subspaces of a spatio-temporal tensor describing the agent’s observations. We present empirical results that Perceptual Schemata enables a state representation that can maintain multiple objects observed in sequence with independent dynamics while an LSTM cannot. We additionally show that Perceptual Schemata can generalize more gracefully to larger environments with more distractor objects, while an LSTM quickly overfits to the training tasks.

Wilka Carvalho, Murray Shanahan
-
[ Visit Poster at Spot A3 in Virtual World ]

We develop a new continual meta-learning method to address challenges in sequential multi-task learning. In this setting, the agent's goal is to achieve high reward over any sequence of tasks quickly. Prior meta-reinforcement learning algorithms have demonstrated promising results in accelerating the acquisition of new tasks. However, they require access to all tasks during training. Beyond simply transferring past experience to new tasks, our goal is to devise continual reinforcement learning algorithms that learn to learn, using their experience on previous tasks to learn new tasks more quickly. We introduce a new method, continual meta-policy search (CoMPS), that removes this limitation by meta-training in an incremental fashion, over each task in a sequence, without revisiting prior tasks. CoMPS continuously repeats two subroutines: learning a new task using RL and using the experience from RL to perform completely offline meta-learning to prepare for subsequent task learning. We find that CoMPS outperforms prior continual learning and off-policy meta-reinforcement methods on several sequences of challenging continuous control tasks.

Glen Berseth, Zhiwei Zhang
-
[ Visit Poster at Spot A1 in Virtual World ]

Reinforcement learning (RL) is typically concerned with estimating single-step policies or single-step models, leveraging the Markov property to factorize the problem in time. However, we can also view RL as a sequence modeling problem, with the goal being to predict a sequence of actions that leads to a sequence of high rewards. Viewed in this way, it is tempting to consider whether powerful, high-capacity sequence prediction models that work well in other domains, such as natural-language processing, can also provide simple and effective solutions to the RL problem. To this end, we explore how RL can be reframed as ``one big sequence modeling'' problem, using state-of-the-art Transformer architectures to model distributions over sequences of states, actions, and rewards. Addressing RL as a sequence modeling problem significantly simplifies a range of design decisions: we no longer require separate behavior policy constraints, as is common in prior work on offline model-free RL, and we no longer require ensembles or other epistemic uncertainty estimators, as is common in prior work on model-based RL. All of these roles are filled by the same Transformer sequence model. In our experiments, we demonstrate the flexibility of this approach across long-horizon dynamics prediction, imitation learning, goal-conditioned RL, and offline RL.

Michael Janner, Qiyang Li, Sergey Levine
-
[ Visit Poster at Spot B2 in Virtual World ]

We provide a unifying view of a large family of previous imitation learning algorithms through the lens of moment matching. At its core, our classification scheme is based on whether the learner attempts to match (1) reward or (2) action-value moments of the expert's behavior, with each option leading to differing algorithmic approaches. By considering adversarially chosen divergences between learner and expert behavior, we are able to derive bounds on policy performance that apply for all algorithms in each of these classes, the first to our knowledge. We also introduce the notion of moment recoverability, implicit in many previous analyses of imitation learning, which allows us to cleanly delineate how well each algorithmic family is able to mitigate compounding errors. We derive three novel algorithm templates (AdVIL, AdRIL, and DAeQuIL) with strong guarantees, simple implementation, and competitive empirical performance.

Gokul Swamy, Sanjiban Choudhury, J. Bagnell, Steven Wu
-
[ Visit Poster at Spot A0 in Virtual World ]

Path planning, the problem of efficiently discovering high-reward trajectories, often requires optimizing a high-dimensional and multimodal reward function. Popular approaches like CEM and CMA-ES greedily focus on promising regions of the search space and may get trapped in local maxima. DOO and VOOT balance exploration and exploitation, but use space partitioning strategies independent of the reward function to be optimized. Recently, LaMCTS empirically learns to partition the search space in a reward-sensitive manner for black-box optimization. In this paper, we develop a novel formal regret analysis for when and why such an adaptive region partitioning scheme works. We also propose a new path planning method PlaLaM which improves the function value estimation within each sub-region, and uses a latent representation of the search space. Empirically, PlaLaM outperforms existing path planning methods in 2D navigation tasks, especially in the presence of difficult-to-escape local optima, and shows benefits when plugged into model-based RL with planning components such as PETS. These gains transfer to highly multimodal real-world tasks, where we outperform strong baselines in compiler phase ordering by up to 245% and in molecular design by up to 0.4 on properties on a 0-1 scale. Code is available at https://github.com/yangkevin2/plalam.

Kevin Yang, Tianjun Zhang, Chris Cummins, Brandon Cui, Benoit Steiner, Linnan Wang, Joseph E Gonzalez, Dan Klein, Yuandong Tian
-
[ Visit Poster at Spot A6 in Virtual World ]

In this paper, we study how mobile manipulators can autonomously learn skills that require a combination of navigation and grasping. Learning robotic skills in the real world remains challenging without large scale data collection and supervision. These difficulties have often been sidestepped by limiting the robot to only manipulation or navigation, and by using human effort to provide demonstrations, task resets, and data labeling during the training process. Our aim is to devise a robotic reinforcement learning system for learning navigation and manipulation together, in a way that minimizes human intervention and enables continual learning under realistic assumptions. Specifically, our system, ReLMM, can learn continuously on a real-world platform without any environment instrumentation, with minimal human intervention, and without access to privileged information, such as maps, objects positions, or a global view of the environment. Our method employs a modularized policy with components for manipulation and navigation, where uncertainty over the manipulation value function drives exploration for the navigation controller, and the success of the manipulation module provides rewards for navigation. We evaluate our method on a room cleanup task, where the robot must pick up each item of clutter from the floor. After a brief grasp pretraining phase with human oversight, ReLMM can learn navigation and grasping together fully automatically, in around 40 hours of real-world training with minimal human intervention.

Charles Sun, Jedrzej Orbik, Coline Devin, Abhishek Gupta, Glen Berseth, Sergey Levine
-
[ Visit Poster at Spot D6 in Virtual World ]

Learning data representations that are useful for various downstream tasks is a cornerstone of artificial intelligence. While existing methods are typically evaluated on downstream tasks such as classification or generative image quality, we propose to assess representations through their usefulness in downstream control tasks, such as reaching or pushing objects. By training over 10,000 reinforcement learning policies, we extensively evaluate to what extent different representation properties affect out-of-distribution (OOD) generalization. Finally, we demonstrate zero-shot transfer of these policies from simulation to the real world, without any domain randomization or fine-tuning. This paper aims to establish the first systematic characterization of the usefulness of learned representations for real-world OOD downstream tasks.

Frederik Träuble, Andrea Dittadi, Manuel Wüthrich, Felix Widmaier, Peter Gehler, Ole Winther, Francesco Locatello, Olivier Bachem, Bernhard Schölkopf, Stefan Bauer
-
[ Visit Poster at Spot D0 in Virtual World ]

We focus on reinforcement learning (RL) in relational problems that are naturally defined in terms of objects, their relations, and manipulations. These problems are characterized by variable state and action spaces, and finding a fixed-length representation, required by most existing RL methods, is difficult, if not impossible. We present a deep RL framework based on graph neural networks and auto-regressive policy decomposition that naturally works with these problems and is completely domain-independent. We demonstrate the framework in three very distinct domains and we report the method’s competitive performance and impressive zero-shot generalization over different problem sizes. In goal-oriented BlockWorld, we demonstrate multi-parameter actions with pre-conditions. In SysAdmin, we show how to select multiple objects simultaneously. In the classical planning domain of Sokoban, the method trained exclusively on 10×10 problems with three boxes solves 89% of 15×15 problems with five boxes.

Jaromír Janisch, Tomas Pevny, Viliam Lisy
-
[ Visit Poster at Spot D1 in Virtual World ]

We extend the framework of Classification with Costly Features (CwCF) that works with samples of fixed dimensions to trees of varying depth and breadth (similar to a JSON/XML file). In this setting, the sample is a tree - sets of sets of features. Individually for each sample, the task is to sequentially select informative features that help the classification. Each feature has a real-valued cost, and the objective is to maximize accuracy while minimizing the total cost. The process is modeled as an MDP where the states represent the acquired features, and the actions select unknown features. We present a specialized neural network architecture trained through deep reinforcement learning that naturally fits the data and directly selects features in the tree. We demonstrate our method in seven datasets and compare it to two baselines.

Jaromír Janisch, Tomas Pevny, Viliam Lisy
-
[ Visit Poster at Spot A1 in Virtual World ]

The use of options can greatly accelerate exploration in RL, especially when only sparse reward signals are available. While option discovery methods have been proposed for individual agents, in MARL settings, discovering collaborative options that can coordinate the behavior of multiple agents and encourage them to jointly visit under-explored regions of the state space has not been considered. In this paper, we propose a novel framework for multi-agent deep covering option discovery. Specifically, it first leverages an attention mechanism to find collaborative agent subgroups that would benefit most from coordination. Then, a hierarchical algorithm based on soft actor-critic, namely H-MSAC, is developed to learn the multi-agent options for each sub-group and then to integrate them through a high-level policy. This hierarchical option construction allows our framework to strike a balance between scalability and effective collaboration among the agents. The evaluation based on multi-agent collaborative tasks shows that the proposed algorithm can effectively capture agent interaction during learning and significantly outperforms prior works using single-agent options or no options, in terms of both faster exploration and higher task rewards.

Jiayu Chen, Marina W Haliem, Tian Lan, Vaneet Aggarwal
-
[ Visit Poster at Spot A4 in Virtual World ]
Efficient exploration is crucial to sample efficient reinforcement learning. In this paper, we present a scalable exploration method called \emph{HyperDQN}, which builds on the famous Deep Q-Network (DQN) \citep{mnih2015human} and extends the idea of hyper model \citep{dwaracher20hypermodel} for deep reinforcement learning. In particular, \emph{HyperDQN} maintains a probabilistic meta-model that captures the epistemic uncertainty of the $Q$-value function over the parameter space. This meta-model samples randomized $Q$-value functions, which will generate exploratory action sequences for deep exploration. The proposed method requires fewer samples to achieve substantially better performance than DQN and BootstrappedDQN \citep{osband16boostrapdqn} on hard-exploration tasks, including deep sea, grid world, and mountain car. The numerical results demonstrate that the developed approach can lead to efficient exploration with limited computation resources.
Ziniu Li, Yingru Li, Hao Liang, Tong Zhang
-
[ Visit Poster at Spot D3 in Virtual World ]

Offline reinforcement learning enables agents to make use of large pre-collected datasets of environment transitions and learn control policies without the need for potentially expensive or unsafe online data collection. Recently, significant progress has been made in offline RL, with the dominant approach becoming methods which leverage a learned dynamics model. This typically involves constructing a probabilistic model, and using it to penalize rewards in regions of high uncertainty, solving for a pessimistic MDP that lower bounds the true MDP. Recent work, however, exhibits a breakdown between theory and practice, whereby pessimistic return ought to be bounded by the total variation distance of the model from the true dynamics, but is instead implemented through a penalty based on estimated model uncertainty. This has spawned a variety of uncertainty heuristics, with little to no comparison between differing approaches. In this paper, we show these heuristics have significant interactions with other design choices, such as the number of models in the ensemble, the model rollout length and the penalty weight. Furthermore, we compare these uncertainty heuristics under a new evaluation protocol that, for the first time, captures the specific covariate shift induced by model-based RL. This allows us to accurately assess the calibration of different proposed penalties. Finally, with these insights, we show that selecting these key hyperparameters using Bayesian Optimization produces drastically stronger performance than existing hand-tuned methods.

Cong Lu, Philip Ball, Jack Parker-Holder, Michael A Osborne, Stephen Roberts
-
[ Visit Poster at Spot A1 in Virtual World ]

Machine learning methods have proven to be effective tools for molecular design, allowing for efficient exploration of the vast chemical space via deep molecular generative models. Here, we propose a graph-based deep generative model for de novo molecular design using reinforcement learning. We demonstrate how the reinforcement learning framework can successfully fine-tune the generative model towards molecules with various desired sets of properties, even when few molecules have the goal attributes initially. We explored the following tasks: decreasing/increasing the size of generated molecules, increasing their drug-likeness, and increasing protein-binding activity. Using our model, we are able to generate 95% predicted active compounds for a common benchmarking task, outperforming previously reported methods on this metric.

Sara Romeo Atance, Ola Engkvist, Simon Olsson, Rocío Mercado
-
[ Visit Poster at Spot C5 in Virtual World ]

Monte Carlo Tree Search (MCTS) has shown its strength for a lot of deterministic and stochastic examples, but literature lacks reports of applications to real world industrial processes. Common reasons for this are that there is no efficient simulator of the process available or there exist problems in applying MCTS to the complex rules of the process. In this paper, we apply MCTS for optimizing a high-precision manufacturing process that has stochastic and partially observable outcomes. We make use of an expert-knowledge-based simulator and adapt the MCTS default policy to deal with the manufacturing process.

Dorina Weichert, Alexander Kister
-
[ Visit Poster at Spot B0 in Virtual World ]
Efficient methods to evaluate new algorithms are critical for improving reinforcement learning systems such as ad recommendation systems. A/B tests are reliable, but are time- and money-consuming, and entail a risk of failure. In this paper, we develop a new method of \textit{off-policy evaluation}, predicting the performance of an algorithm given historical data generated by a different algorithm. Our estimator converges in probability to the true value of a counterfactual algorithm at a rate of $\sqrt{N}$. We also show how to correctly estimate the variance of our estimator. In a special-case setting which covers contextual bandits, we show that our estimator achieves the lowest variance among a wide class of estimators. These properties hold even when the analyst does not know which among a large number of state variables are actually important, or when the baseline policy is unknown. We validate our method with a simulation experiment and on real-world data from a major advertisement company. We apply our method to improve an ad policy for the aforementioned company. We find that our method produces smaller mean squared errors than state-of-the-art methods.
Richard Liu, Yusuke Narita, Kohei Yata
-
[ Visit Poster at Spot C4 in Virtual World ]

Solving complex real-world tasks, e.g., autonomous fleet control, often involves a coordinated team of multiple agents which learn strategies from visual inputs via reinforcement learning. Many existing multi-agent reinforcement learning (MARL) algorithms however don’t scale to environments where agents operate on visual in-puts. To address this issue, algorithmically, recent works have focused on non-stationarity, exploration or communication. In contrast, we study whether scalability can also be achieved via a dis-entangled representation. For this, we explicitly construct an object-centric intermediate representation to characterize the states of an environment, which we refer to as ‘semantic tracklets.’ We evaluate ‘semantic tracklets’ on the visual multi-agent particle environment (VMPE) and on the challenging visual multi-agent GFootball environment. ‘Semantic tracklets’ consistently outperform baselines on VMPE, and achieve a +2.4 higher score difference than baselines on GFootball. Notably, this method is the first to success-fully learn a strategy for five players in the GFootball environment using only visual data.

Iou-Jen Liu, Zhongzheng Ren, Raymond Yeh, Alex Schwing
-
[ Visit Poster at Spot B2 in Virtual World ]

Reinforcement learning (RL) can be used to learn treatment policies and aid decision making in healthcare. However, given the need for generalization over complex state/action spaces, the incorporation of function approximators (e.g., deep neural networks) requires model selection to reduce overfitting and improve policy performance at deployment. Yet a standard validation pipeline for model selection requires running a learned policy in the actual environment, which is often infeasible in a healthcare setting. In this work, we investigate a model selection pipeline for offline RL that relies on off-policy evaluation (OPE) as a proxy for validation performance. We present an in-depth analysis of popular OPE methods, highlighting the additional hyperparameters and computational requirements (fitting/inference of auxiliary models) when used to rank a set of candidate policies. To compare the utility of different OPE methods as part of the model selection pipeline, we experiment with a clinical decision-making task of sepsis treatment. Among all the OPE methods, FQE is the most robust to different sampling conditions (with various sizes and data-generating behaviors) and consistently leads to the best validation ranking, but this comes with a high computational cost. To balance this trade-off between accuracy of ranking and computational efficiency, we propose a simple two-stage approach to accelerate model selection by avoiding potentially unnecessary computation. Our work represents an important first step towards enabling fairer comparisons in offline RL; it serves as a practical guide for offline RL model selection and can help RL practitioners in healthcare learn better policies on real-world datasets.

Shengpu Tang, Jenna Wiens
-
[ Visit Poster at Spot A0 in Virtual World ]

Offline Reinforcement Learning (RL) aims at learning an optimal control from a fixed dataset, without interactions with the system. An agent in this setting should avoid selecting actions whose consequences cannot be predicted from the data. This is the converse of exploration in RL, which favors such actions. We thus take inspiration from the literature on bonus-based exploration to design a new offline RL agent. The core idea is to subtract a prediction-based exploration bonus from the reward, instead of adding it for exploration. This allows the policy to stay close to the support of the dataset. We connect this approach to a more common regularization of the learned policy towards the data. Instantiated with a bonus based on the prediction error of a variational autoencoder, we show that our agent is competitive with the state of the art on a set of continuous control locomotion and manipulation tasks.

Shideh Rezaeifar, Robert Dadashi, Nino Vieillard, Léonard Hussenot, Olivier Bachem, Olivier Pietquin, Matthieu Geist
-
[ Visit Poster at Spot A4 in Virtual World ]

A generalist robot equipped with learned skills must be able to perform many tasks in many different environments. However, zero-shot generalization to new settings is not always possible. When the robot encounters a new environment or object, it may need to finetune some of its previously learned skills to accommodate this change. But crucially, previously learned behaviors and models should still be suitable to accelerate this relearning. In this paper, we aim to study how generative models of possible outcomes can allow a robot to learn visual representations of affordances, so that the robot can sample potentially possible outcomes in new situations, and then further train its policy to achieve those outcomes. In effect, prior data is used to learn what kinds of outcomes may be possible, such that when the robot encounters an unfamiliar setting, it can sample potential outcomes from its model, attempt to reach them, and thereby update both its skills and its outcome model. This approach, visuomotor affordance learning (VAL), can be used to train goal-conditioned policies that operate on raw image inputs, and can rapidly learn to manipulate new objects via our proposed affordance-directed exploration scheme. We show that VAL can utilize prior data to solve real-world tasks such drawer opening, grasping, and placing objects in new scenes with only five minutes of online experience in the new scene.

Khazatsky Alexander, Ashvin Nair
-
[ Visit Poster at Spot A1 in Virtual World ]

The model-free deep reinforcement learning framework is well adapted to the field of robotics, but it is difficult to deploy in the real world due to the poor sample efficiency of the learning process. In the widely used temporal-difference algorithms, this inefficiency is partly due to the noisy supervision caused by bootstrapping. The label noise is heteroscedastic: the target network prediction is subject to epistemic uncertainty which depends on the input and the learning process. We propose Inverse-Variance RL, which uses uncertainty predictions to weight the samples in the mini-batch during the Bellman update following the Batch Inverse-Variance approach for heteroscedastic regression in neural networks. We show experimentally that this approach improves the sample efficiency of DQN in two environments, and propose directions for further work on this method.

Vincent Mai, Kaustubh Mani, Liam Paull
-
[ Visit Poster at Spot C6 in Virtual World ]

Reinforcement learning techniques are often used to model and analyze the behavior of sports teams and players. However, learning these models from observed data is challenging. The data is very sparse and does not include the intended end location of actions which are needed to model decision making. Evaluating the learned models is also extremely difficult as no ground truth is available. In this work, we propose an approach that addresses these challenges when learning a Markov model of professional soccer matches from event stream data. We apply a combination of predictive modelling and domain knowledge to obtain the intended end locations of actions and learn the transition model using a Bayesian approach to resolve sparsity issues. We provide intermediate evaluations as well as an approach to evaluate the final model. Finally, we show the model's usefulness in practice for both evaluating and rating players' decision making using data from the 17/18 and 18/19 English Premier League seasons.

Maaike Van Roy, Pieter Robberechts, Wen-Chi Yang, Luc De Raedt, Jesse Davis
-
[ Visit Poster at Spot C4 in Virtual World ]

State-of-the-art deep Q-learning methods update Q-values using state transition tuples sampled from the experience replay buffer. Often this strategy is to randomly sample or prioritize data sampling based on measures such as the temporal difference (TD) error. Such sampling strategies are agnostic to the structure of the Markov decision process (MDP) and can therefore be data inefficient at propagating reward signals from goal states to the initial state. To accelerate reward propagation, we make use of the MDP structure by organizing the agent's experience into a graph. Each edge in the graph represents a transition between two connected states. We perform value backups via a breadth-first search that expands vertices in the graph starting from the set of terminal states successively moving backward. We empirically show that our method is substantially more data-efficient than several baselines on a diverse range of sparse reward tasks. Notably, the proposed method also outperforms baselines that have the advantage of a much larger computational budget.

Zhang-Wei Hong, Tao Chen, Yen-Chen Lin, Joni Pajarinen, Pulkit Agrawal
-
[ Visit Poster at Spot B6 in Virtual World ]

Human beings, even small children, quickly become adept at figuring out how to use applications on their mobile devices. Learning to use a new app is often achieved via trial-and-error, accelerated by transfer of knowledge from past experiences with like apps. The prospect of building a smarter smartphone — one that can learn how to achieve tasks using mobile apps — is tantalizing. In this paper we explore the use of Reinforcement Learning (RL) with the goal of advancing this aspiration. We introduce an RL-based framework for learning to accomplish tasks in mobile apps. RL agents are provided with states derived from the underlying representation of on-screen elements, and rewards that are based on progress made in the task. Agents can interact with screen elements by tapping or typing. Our experimental results, over a number of mobile apps, show that RL agents can learn to accomplish multi-step tasks, as well as achieve modest generalization across different apps. More generally, we develop a platform which addresses several engineering challenges to enable an effective RL training environment. Our AppBuddy platform is compatible with OpenAI Gym and includes a suite of mobile apps and benchmark tasks that supports a diversity of RL research in the mobile app setting.

Maayan Shvo, Zhiming Hu, Rodrigo A Toro Icarte, Iqbal Mohomed, Allan Jepson, Sheila McIlraith
-
[ Visit Poster at Spot A5 in Virtual World ]

How might we design Reinforcement Learning (RL)-based recommenders that encourage aligning user trajectories with the underlying user satisfaction? Three research questions are key: (1) measuring user satisfaction, (2) combatting sparsity of satisfaction signals, and (3) adapting the training of the recommender agent to maximize satisfaction. For measurement, it has been found that surveys explicitly asking users to rate their experience with consumed items can provide valuable orthogonal information to the engagement/interaction data, acting as a proxy to the underlying user satisfaction. For sparsity, i.e, only being able to observe how satisfied users are with a tiny fraction of user-item interactions, imputation models can be useful in predicting satisfaction level for all items users have consumed. For learning satisfying recommender policies, we postulate that reward shaping in RL recommender agents is powerful for driving satisfying user experiences. Putting everything together, we propose to jointly learn a policy network and a satisfaction imputation network: The role of the imputation network is to learn which actions are satisfying to the user; while the policy network, built on top of REINFORCE, decides which items to recommend, with the reward utilizing the imputed satisfaction. We use both offline analysis and live experiments in an industrial large-scale recommendation platform to demonstrate the promise of our approach for satisfying user experiences.

Konstantina Christakopoulou, Can Xu, Sai Zhang, Sriraj Badam, Daniel Li, Hao Wan, Xinyang Yi, Ya Le, Chris Berg, Eric Bencomo Dixon, Ed Chi, Minmin Chen
-
[ Visit Poster at Spot A3 in Virtual World ]

Meta reinforcement learning (meta-RL) aims to learn a policy solving a set of training tasks simultaneously and quickly adapting to new tasks. It requires massive amounts of data drawn from training tasks to infer the common structure shared among tasks. Without heavy reward engineering, the sparse rewards in long-horizon tasks exacerbate the problem of sample efficiency in meta-RL. Another challenge in meta-RL is the discrepancy of difficulty level among tasks, which might cause one easy task dominating learning of the shared policy and thus preclude policy adaptation to new tasks. In this work, we introduce a novel objective function to learn an action translator among training tasks. We theoretically verify that value of the transferred policy with the action translator can be close to the value of the source policy. We propose to combine the action translator with context-based meta-RL algorithms for better data collection and more efficient exploration during meta-training. Our approach of policy transfer empirically improves the sample efficiency and performance of meta-RL algorithms on sparse-reward tasks.

Yijie Guo, Qiucheng Wu, Honglak Lee
-
[ Visit Poster at Spot A0 in Virtual World ]

A key challenge to deploying reinforcement learning in practice is exploring safely. We propose a natural safety property---uniformly outperforming a conservative policy (adaptively estimated from all data observed thus far), up to a per-episode exploration budget. This property formalizes the idea that we should spread out exploration to avoid taking actions significantly worse than the ones that are currently known to be good. We then design an algorithm that uses a UCB reinforcement learning policy for exploration, but overrides it as needed to ensure safety with high probability. To ensure exploration across the entire state space, it adaptively determines when to explore (at different points of time across different episodes) in a way that allows "stitching" sub-episodes together to obtain a meta-episode that is equivalent to using UCB for the entire episode. Then, we establish reasonable assumptions about the underlying MDP under which our algorithm is guaranteed to achieve sublinear regret while ensuring safety; under these assumptions, the cost of imposing safety is only a constant factor.

Wanqiao Xu, Kan Xu, Hamsa Bastani, Osbert Bastani
-
[ Visit Poster at Spot C0 in Virtual World ]

On July 1st, 2020, members of the European Union lifted earlier COVID-19 restrictions on non-essential travel. In response, we designed and deployed ``Eva" – a novel bandit algorithm – across all Greek borders to identify asymptomatic travelers infected with SARS-CoV-2 based on demographic characteristics and results from previously tested travelers. Eva allocates Greece’s limited testing resources to (i) limit the importation of new cases and (ii) provide real-time estimates of COVID-19 prevalence to inform border policies. Counterfactual analysis shows that our system identified on average 1.85x as many asymptomatic, infected travelers as random testing, and up to 2-4x as many during peak travel. For most countries, Eva identified atypically high prevalence 9 days earlier than machine learning systems based on public data. By adaptively adjusting border policies 9 days earlier, Eva prevented additional infected travelers from arriving.

Hamsa Bastani, Kimon Drakopoulos, Vishal Gupta
-
[ Visit Poster at Spot B2 in Virtual World ]

Reinforcement learning is difficult to apply to real world problems due to high sample complexity, the need to adapt to regular distribution shifts, often encountered in the real world, and the complexities of learning from high-dimensional inputs, such as images. Over the last several years meta-learning has emerged as a promising approach to tackle these problems by explicitly training an agent to quickly adapt to novel tasks. However, such methods still require huge amounts of data during training are are difficult to optimize in high-dimensional domains. One potential solution is to consider offline or batch meta-learning - learning from existing datasets without additional environment interactions during training. In this work we develop the first offline meta-learning algorithm that operates from images in tasks with sparse rewards. Our approach has three main components: a novel strategy to construct meta-exploration trajectories from offline data, a deep variational filter training and latent offline model-free policy optimization. We show that our method completely solves a realistic meta-learning task involving robot manipulation, while naive combinations of meta-learning and offline algorithms significantly under-perform.

Rafael Rafailov, Varun Kumar, Tianhe (Kevin) Yu, Avi Singh, mariano phielipp, Chelsea Finn
-
[ Visit Poster at Spot A4 in Virtual World ]

A key aspect of human intelligence is their ability to convey their knowledge to others in succinct forms. However, current machine learning models are largely blackboxes that are hard for humans to learn from. Focusing on sequential decision-making, we design a novel machine learning algorithm that is capable of conveying its insights to humans in the form of interpretable "tips''. Our algorithm selects the tip that best bridges the gap between the actions taken by the human users and those taken by the optimal policy in a way that accounts for which actions are consequential for achieving higher performance. We evaluate our approach through a series of randomized controlled user studies where participants manage a virtual kitchen. Our experiments show that the tips generated by our algorithm can significantly improve human performance. In addition, we discuss a number of empirical insights that can help inform the design of algorithms intended for human-AI collaboration. For instance, we find evidence that participants do not simply blindly follow our tips; instead, they combine them with their own experience to discover additional strategies for improving performance.

Hamsa Bastani, Osbert Bastani, Park Sinchaisri
-
[ Visit Poster at Spot D6 in Virtual World ]

Offline policy optimization has a critical impact on many real-world decision-making problems, as online learning is costly and concerning in many applications. Importance sampling and its variants are a widely used type of estimator in offline policy evaluation, which can be helpful to remove dependence on the chosen function approximations used to represent value functions and process models. In this paper, we identify an important overfitting phenomenon in optimizing the importance weighted return, and propose an algorithm to avoid this overfitting. We provide a theoretical justification of the proposed algorithm through a better per-state-neighborhood normalization condition and show the limitation of previous attempts to this approach through an illustrative example. We further test our proposed method in a healthcare-inspired simulator and a logged dataset collected from real hospitals. These experiments show the proposed method with less overfitting and better test performance compared with state-of-the-art batch reinforcement learning algorithms.

Yao Liu, Emma Brunskill
-
[ Visit Poster at Spot B5 in Virtual World ]

Pivot-based neural machine translation (NMT) is commonly used in low-resource setups, especially for translation between non-English language pairs. It benefits from using high-resource source-to-pivot and pivot-to-target language pairs and an individual system is trained for both sub-tasks. However, these models have no connection during training, and the source-to-pivot model is not optimized to produce the best translation for the source-to-target task. In this work, we propose to train a pivot-based NMT system with the reinforcement learning (RL) approach, which has been investigated for various text generation tasks, including machine translation (MT). We utilize a non-autoregressive transformer and present an end-to-end pivot-based integrated model, enabling training on source-to-target data.

Evgeniia Tokarchuk, Jan Rosendahl, Weiyue Wang, Pavel Petrushkov, Tomer Lancewicki, Shahram Khadivi, Hermann Ney
-
[ Visit Poster at Spot B5 in Virtual World ]

We apply reinforcement learning to solve personalized post-discharge intervention problem. The ultimate goal is to reduce the 30-day hospital readmission rate under possible budget constraints. To deal with the issue of small sample size in each patient class for personalized intervention policy, we develop a new data-pooling estimator and the corresponding data-pooling RLSVI reinforcement learning algorithm. We establish theoretical performance guarantee for this data-pooling RLSVI algorithm and demonstrate its empirical success with a real hospital dataset.

Xinyun Chen, Pengyi Shi
-
[ Visit Poster at Spot B3 in Virtual World ]

This paper studies offline Imitation Learning (IL) where an agent learns to imitate an expert demonstrator without additional online environment interactions. Instead, the learner is presented with a static offline dataset of state-action-next state transition triples from a potentially less proficient behavior policy. We introduce Model-based IL from Offline data (MILO): an algorithmic framework that utilizes the static dataset to solve the offline IL problem efficiently both in theory and in practice. In theory, even if the behavior policy is highly sub-optimal compared to the expert, we show that as long as the data from the behavior policy provides sufficient coverage on the expert state-action traces (and with no necessity for a global coverage over the entire state-action space), MILO can provably combat the covariate shift issue in IL. Complementing our theory results, we also demonstrate that a practical implementation of our approach mitigates covariate shift on benchmark MuJoCo continuous control tasks. We demonstrate that with behavior policies whose performances are less than half of that of the expert, MILO still successfully imitates with an extremely low number of expert state-action pairs while traditional offline IL methods such as behavior cloning (BC) fail completely.

Jonathan Chang, Masatoshi Uehara, Dhruv Sreenivas, Rahul Kidambi, Wen Sun
-
[ Visit Poster at Spot B5 in Virtual World ]

This paper studies Imitation Learning from Observations alone (ILFO) where the learner is presented with expert demonstrations that consist only of states visited by an expert (without access to actions taken by the expert). We present a provably efficient model-based framework MobILE to solve the ILFO problem. MobILE involves carefully trading off strategic exploration against imitation - this is achieved by integrating the idea of optimism in the face of uncertainty into the distribution matching imitation learning (IL) framework. We provide a unified analysis for MobILE, and demonstrate that MobILE enjoys strong performance guarantees for classes of MDP dynamics that satisfy certain well studied notions of structural complexity. We also show that the ILFO problem is strictly harder than the standard IL problem by presenting an exponential sample complexity separation between IL and ILFO. We complement these theoretical results with experimental simulations on benchmark OpenAI Gym tasks that indicate the efficacy of MobILE.

Rahul Kidambi, Jonathan Chang, Wen Sun
-
[ Visit Poster at Spot A6 in Virtual World ]

We study objective robustness failures, a type of out-of-distribution robustness failure in reinforcement learning (RL). Objective robustness failures occur when an RL agent retains its capabilities out-of-distribution yet pursues the wrong objective. This kind of failure presents different risks than the robustness problems usually considered in the literature, since it involves agents that leverage their capabilities to pursue the wrong objective rather than simply failing to do anything useful.We provide the first explicit empirical demonstrations of objective robustness failures and present a partial characterization of its causes.

Lauro Langosco di Langosco, Lee Sharkey
-
[ Visit Poster at Spot A5 in Virtual World ]

Reinforcement learning (RL) for continuous control typically employs distributions whose support covers the entire action space. In this work, we investigate the colloquially known phenomenon that trained agents often prefer actions at the boundaries of that space. We draw theoretical connections to the emergence of bang-bang behavior in optimal control, and provide extensive empirical evaluation across a variety of recent RL algorithms. We replace the normal Gaussian by a Bernoulli distribution that solely considers the extremes along each action dimension - a bang-bang controller. Surprisingly, this achieves state-of-the-art performance on several continuous control benchmarks - in contrast to robotic hardware, where energy and maintenance cost affect controller choices.To reduce the impact of exploration on our analysis, we provide additional imitation learning experiments. Finally, we show that our observations extend to environments that aim to model real-world challenges and evaluate factors to mitigate the emergence of bang-bang solutions.Our findings emphasise challenges for benchmarking continuous control algorithms, particularly in light of real-world applications.

Tim Seyde, Igor Gilitschenski, Wilko Schwarting, Bartolomeo Stellato, Martin Riedmiller, Markus Wulfmeier, Daniela Rus
-
[ Visit Poster at Spot B4 in Virtual World ]

Off-policy evaluation (OPE) attempts to predict the performance of counterfactual policies using log data from a different policy. We expand its applicability by developing an OPE method for a class of stochastic and deterministic logging policies. This class includes deterministic bandit (such as Upper Confidence Bound) as well as deterministic decision-making based on supervised and unsupervised learning. We prove that our method's prediction converges in probability to the true performance of a counterfactual policy as the sample size increases. We validate our method with experiments on partly and entirely deterministic logging policies. Finally, we apply it to evaluate coupon targeting policies by a major online platform and show how to improve the existing policy.

Kyohei Okumura, Yusuke Narita, Kohei Yata, Akihiro Shimizu
-
[ Visit Poster at Spot D2 in Virtual World ]

Multi-agent control problems constitute an interesting area of application for deep reinforcement learning models with continuous action spaces. Such real-world applications, however, typically come with critical safety constraints that must not be violated. In order to ensure safety, we enhance the well-known multi-agent deep deterministic policy gradient (MADDPG) framework by adding a safety layer to the deep policy network. In particular, we extend the idea of linearizing the single-step transition dynamics, as was done for single-agent systems in Safe DDPG (Dalal et al., 2018), to multi-agent settings. We additionally propose to circumvent infeasibility problems in the action correction step using soft constraints (Kerrigan & Maciejowski, 2000). Results from the theory of exact penalty functions can be used to guarantee constraint satisfaction of the soft constraints under mild assumptions. We empirically find that the soft formulation achieves a dramatic decrease in constraint violations, making safety available even during the learning procedure.

Athina Nisioti, Dario Pavllo, Jonas Kohler
-
[ Visit Poster at Spot B3 in Virtual World ]

We present the first application of reinforcement learning in materials discovery domain that explicitly considers logical structure of the interactions between the RL agent and the environment. Here, environment is defined as the space of experiments accessible via a realistic experimental platform. We explicitly pursue training of generalizable agents that learn to navigate abstract space of experiments relevant to materials preparation. The training is facilitated by a data-augmentation strategy that recycles moderate volume of real experimental data. Experiments show that the agent can successfully search for the experiments to produce materials of the desired properties and characteristics. Furthermore, the agent learns to avoid proposing experiments that will result in undesired materials, for example the agent avoids a cross-linked form of a polymer when cross-linking should be avoided.

Sarath Swaminathan, Dmitry Zubarev, Subhajit Chaudhury, Asim Munawar
-
[ Visit Poster at Spot A3 in Virtual World ]

We study the problem of safe offline reinforcement learning (RL), the goal is to learn a policy that maximizes long-term reward while satisfying safety constraints given only offline data, without further interaction with the environment. This problem is more appealing for real world RL applications, in which data collection is costly or dangerous. Enforcing constraint satisfaction is non-trivial, especially in offline settings, as there is a potential large discrepancy between the policy distribution and the data distribution, causing errors in estimating the value of safe constraints. We show that na\"ive approaches that combine techniques from safe RL and offline RL can only learn sub-optimal solution. We thus develop a simple yet effective algorithm, Constraints Penalized Q-Learning (CPQ), to solve the problem. Our method admits the use of data generated by mixed behavior policies. We present a theoretical analysis and demonstrate empirically that our approach can learn robustly across a variety of benchmark control tasks, outperforming several baselines.

Haoran Xu, Xianyuan Zhan, Xiangyu Zhu
-
[ Visit Poster at Spot D0 in Virtual World ]

Traffic signal control is of critical importance for the effective use of transportation infrastructures. Unfortunately, the rapid increase of different types of vehicles makes traffic signal control more and more challenging. Reinforcement Learning (RL) based algorithms have demonstrated their potential in dealing with traffic signal control. However, most existing solutions would require a large amount of training data, which is unacceptable for many real-world scenarios. This paper proposes a novel model-based meta-reinforcement learning framework (ModelLight) for traffic signal control. Within ModelLight, an ensemble of models for road intersections and the optimization-based meta-learning method are used to improve the data efficiency of an RL-based traffic light control method. Experiments on real-world datasets demonstrate that the proposed ModelLight can outperform state-of-the-art traffic light control algorithms while substantially reducing the required interactions with the real-world environment.

Xingshuai Huang, di wu, Benoit Boulet
-
[ Visit Poster at Spot C3 in Virtual World ]

Trading markets represent a real-world financial application to deploy reinforcement learning agents, however, they carry hard fundamental challenges such as high variance and costly exploration. Moreover, markets are inherently a multiagent domain composed of many actors taking actions and changing the environment. To tackle these type of scenarios agents need to exhibit certain characteristics such as \emph{risk-awareness}, \emph{robustness to perturbations} and \emph{low learning variance}. We take those as building blocks and propose a family of four algorithms. First, we contribute with two algorithms that use risk-averse objective functions and variance reduction techniques. Then, we augment the framework to multi-agent learning and assume an adversary which can take over and perturb the learning process. Our third and fourth algorithms perform well under this setting and balance theoretical guarantees with practical use. Additionally, we consider the multi-agent nature of the environment and our work is the first one extending empirical game theory analysis for multi-agent learning by considering risk-sensitive payoffs.

Yue Gao, Pablo Hernandez-Leal, Kry Yik Chau Lui
-
[ Visit Poster at Spot B1 in Virtual World ]

As global demand for electricity increases, operating power networks has become more complex. Power network operation can be posed as a reinforcement learning (RL) task, and there is increasing interest in developing RL agents that can automate operation. The Learning To Run Power Network (L2RPN) environment models a real-world electric grid, posing as a test bed for these RL agents. Agents must be robust, i.e., ensure reliable electricity flow even when some power lines are disconnected. Because of the large state and action space of power grids, robustness is hard to achieve and has become a key technical obstacle in widespread adoption of RL for power networks. To improve the robustness of L2RPN agents, we propose adversarial training. We make the following contributions: 1) we design an agent-specific \emph{adversary MDP} to train an adversary that minimizes a given agent's reward; 2) we demonstrate the potency of our adversarial policies against winning agent policies from the L2RPN challenge; 3) we improve the robustness of a winning L2RPN agent by adversarially training it against our learned adversary. To the best of our knowledge, we provide the first evidence that learned adversaries for power network agents are potent. We also demonstrate a novel, real-world application of adversarial training: improving the robustness of RL agents for power networks.

Alex Pan, Yongkyun Lee, Huan Zhang
-
[ Visit Poster at Spot D5 in Virtual World ]

Deep Reinforcement Learning (RL) agents have achieved superhuman performance on several video game suites. However, unlike humans the trained policies fail to transfer between related games or even between different levels of the same game. Recent works have shown that ideas such as data augmentation and learning domain invariant features can reduce this generalization gap. However, the transfer performance still remains unsatisfactory. In this work we use procedurally generated video games to empirically investigate several hypotheses to explain the lack of transfer. Contrary to the belief that lack of generalizable visual features results in poor policy generalization, we find that visual features transfer across levels, but the inability to use these features to predict actions in new levels limits the overall transfer. We also show that simple auxiliary tasks can improve generalization and lead to policies that transfer as well as the state of the art methods using data augmentation. Finally, to inform fruitful avenues for future research, we construct simple oracle methods that close the generalization gap.

aajay3110 Ajay, Ge Yang, Ofir Nachum, Pulkit Agrawal
-
[ Visit Poster at Spot D2 in Virtual World ]

Management of chronic diseases such as diabetes mellitus requires adaptation of treatment regimes based on patient characteristics and response. There is no single treatment that fits all patients in all contexts; moreover, the set of admissible treatments usually varies over the course of the disease. In this paper, we address the problem of optimizing treatment regimes under time-varying constraints by using volatile contextual Gaussian process bandits. In particular, we propose a variant of GP-UCB with volatile arms, which takes into account the patient's context together with the set of admissible treatments when recommending new treatments. Our Bayesian approach is able to provide treatment recommendations to the patients along with confidence bounds which can be used for risk assessment. We use our algorithm to recommend bolus insulin doses for type 1 diabetes mellitus patients. Simulation studies show that our algorithm compares favorably with traditional blood glucose regulation methods.

Ahmet Alparslan Celik, Cem Tekin
-
[ Visit Poster at Spot C2 in Virtual World ]

Reinforcement Learning (RL) algorithms can in principle acquire complex robotic skills by learning from large amounts of data in the real world, collected via trial and error. However, most RL algorithms use a carefully engineered setup in order to collect data, requiring human supervision and intervention to provide episodic resets. This is particularly evident in challenging robotics problems, such as dexterous manipulation. To make data collection scalable, such applications require reset-free algorithms that are able to learn autonomously, without explicit instrumentation or human intervention. Most prior work in this area handles single-task learning. However, we might also want robots that can perform large repertoires of skills. At first, this would appear to only make the problem harder. However, the key observation we make in this work is that an appropriately chosen multi-task RL setting actually alleviates the reset-free learning challenge, with minimal additional machinery required. In effect, solving a multi-task problem can directly solve the reset-free problem since different combinations of tasks can serve to perform resets for other tasks. By learning multiple tasks together and appropriately sequencing them, we can effectively learn all of the tasks together reset-free. This type of multi-task learning can effectively scale reset-free learning schemes to much more complex problems, as we demonstrate in our experiments. We propose a simple scheme for multi-task learning that tackles the reset-free learning problem, and show its effectiveness at learning to solve complex dexterous manipulation tasks in both hardware and simulation without any explicit resets. This work shows the ability to learn dexterous manipulation behaviors in the real world with RL without any human intervention.

Abhishek Gupta, Justin Yu, Tony Z. Zhao, Vikash Kumar, Aaron Rovinsky, Kelvin Xu, Thomas Devlin, Sergey Levine
-
[ Visit Poster at Spot D0 in Virtual World ]

This paper seeks to tackle the bin packing problem (BPP) through a learning perspective. Building on self-attention-based encoding and deep reinforcement learning algorithms, we propose a new end-to-end learning model for this task of interest. By decomposing the combinatorial action space, as well as utilizing a new training technique denoted as prioritized oversampling, which is a general scheme to speed up on-policy learning, we achieve state-of-the-art performance in a range of experimental settings. Moreover, although the proposed approach attend2pack targets offline-BPP, we strip our method down to the strict on-line BPP setting where it is also able to achieve state-of-the-art performance. With a set of ablation studies as well as comparisons against a range of previous works, we hope to offer as a valid baseline approach to this field of study.

Jingwei Zhang, Bin Zi, Xiaoyu Ge
-
[ Visit Poster at Spot A5 in Virtual World ]

In an ever expanding set of research and application areas, deep neural networks (DNNs) set the bar for algorithm performance. However, depend-ing upon additional constraints such as processing power and execution time limits, or requirements such as verifiable safety guarantees, it may not be feasible to actually use such high-performingDNNs in practice. Many techniques have been developed in recent years to compress or distill complex DNNs into smaller, faster or more understandable models and controllers. This work seeks to identify reduced models that not only preserve a desired performance level, but also, for example, succinctly explain the latent knowledge represented by a DNN. We illustrate the effective-ness of the proposed approach on the evaluation of decision tree variants and kernel machines in the context of benchmark reinforcement learning tasks.

Nathan Dahlin, Rahul Jain, Pierluigi Nuzzo, Krishna Kalagarla, Nikhil Naik
-
[ Visit Poster at Spot C3 in Virtual World ]

We introduce a framework that abstracts Reinforcement Learning (RL) as a sequence modeling problem. This allows us to draw upon the simplicity and scalability of the Transformer architecture, and associated advances in language modeling such as GPT-x and BERT. In particular, we present Decision Transformer, an architecture that casts the problem of RL as conditional sequence modeling. Unlike prior approaches to RL that fit value functions or compute policy gradients, Decision Transformer simply outputs the optimal actions by leveraging a causally masked Transformer. By conditioning an autoregressive model on the desired return (reward), past states, and actions, our Decision Transformer model can generate future actions that achieve the desired return. Despite the simplicity, Decision Transformer matches or exceeds the performance of state-of-the-art model-free offline RL baselines on Atari, OpenAI Gym, and Key-to-Door tasks.

Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Michael Laskin, Pieter Abbeel, Aravind Srinivas, Igor Mordatch
-
[ Visit Poster at Spot A2 in Virtual World ]

Offline reinforcement learning (RL) algorithms have shown promising results in domains where abundant pre-collected data is available. However, prior methods focus on solving individual problems from scratch with an offline dataset without considering how an offline RL agent can acquire multiple skills. We argue that a natural use case of offline RL is in settings where we can pool large amounts of data collected in a number of different scenarios for solving various tasks, and utilize all this data to learn strategies for all the tasks more effectively rather than training each one in isolation. To this end, we study the offline multi-task RL problem, with the goal of devising data-sharing strategies for effectively learning behaviors across all of the tasks. While it is possible to share all data across all tasks, we find that this simple strategy can actually exacerbate the distributional shift between the learned policy and the dataset, which in turn can lead to very poor performance. To address this challenge, we develop a simple technique for data-sharing in multi-task offline RL that routes data based on the improvement over the task-specific data. We call this approach conservative data sharing (CDS), and it can be applied with any single-task offline RL method. On a range of challenging multi-task locomotion, navigation, and image-based robotic manipulation problems, CDS achieves the best or comparable performance compared to prior offline multi-task RL methods and previously proposed online multi-task data sharing approaches.

Tianhe (Kevin) Yu, Aviral Kumar, Yevgen Chebotar, Karol Hausman, Sergey Levine, Chelsea Finn
-
[ Visit Poster at Spot B3 in Virtual World ]

While deep reinforcement learning (RL) methods present an appealing approach to sequential decision-making, such methods are often unstable in practice. What accounts for this instability? Recent theoretical analysis of overparameterized supervised learning with stochastic gradient descent shows that learning is driven by an implicit regularizer, which results in simpler functions that generalize despite overparameterization. However, in this paper, we show that in the case of deep RL, this very same implicit regularization can instead lead to degenerate features. Specifically, features learned by the value function at state-action tuples appearing on both sides of the Bellman update "co-adapt" to each other, giving rise to poor solutions. We characterize the nature of this implicit regularizer in temporal difference learning algorithms and show that this regularizer recapitulates recent empirical findings regarding the rank collapse of learned features and provides an understanding for its cause. To address the adverse impacts of this implicit regularization, we propose a simple and effective explicit regularizer, DR3. DR3 minimizes the similarity of learned features of the Q-network at consecutive state-action tuples in the TD update. Empirically, when combined with existing offline RL methods, DR3 substantially improves both performance and stability on Atari 2600 games, D4RL domains, and robotic manipulation from images.

Aviral Kumar, Rishabh Agarwal, Aaron Courville, Tengyu Ma, George Tucker, Sergey Levine
-
[ Visit Poster at Spot B0 in Virtual World ]

Although well-established in general reinforcement learning (RL), value-based methods are rarely explored in constrained RL (CRL) for their incapability of finding policies that can randomize among multiple actions. To apply value-based methods to CRL, a recent groundbreaking line of game-theoretic approaches uses the mixed policy that randomizes among a set of carefully generated policies to converge to the desired constraint-satisfying policy. However, these approaches require storing a large set of policies, which is not policy efficient, and may incur prohibitive memory costs in large-scale applications. To address this problem, we propose an alternative approach. Our approach first reformulates the CRL problem to an equivalent distance optimization problem. With a specially designed linear optimization oracle, we derive a meta-algorithm that solves it using any off-the-shelf RL algorithm and any conditional gradient (CG) type algorithm as subroutines. We then propose a new variant of the CG-type algorithm, which generalizes the minimum norm point (MNP) method. The proposed method matches the convergence rate of the existing game-theoretic approaches and achieves the worst-case optimal policy efficiency. The experiments on a navigation task show that our method reduces the memory costs by an order of magnitude, and meanwhile achieves better performance, demonstrating both its effectiveness and efficiency.

Tianchi Cai, Wenpeng Zhang, Lihong Gu, Xiaodong Zeng, Jinjie Gu
-
[ Visit Poster at Spot C2 in Virtual World ]

Imitation learning has been proved to be effective in mimicking experts' behaviors from their demonstrations without access to explicit reward signals. Meanwhile, complex tasks, e.g., dynamic treatment regimes for patients with comorbidities, often suggest significant variability in expert demonstrations with multiple sub-tasks. In these cases, it could be difficult to use a single flat policy to handle tasks of hierarchical structures. In this paper, we propose the hierarchical imitation learning model, HIL, to jointly learn latent high-level policies and sub-policies (for individual sub-tasks) from expert demonstrations without prior knowledge. First, HIL learns sub-policies by imitating expert trajectories with the sub-task switching guidance from high-level policies. Second, HIL collects the feedback from its sub-policies to optimize high-level policies, which is modeled as a contextual multi-arm bandit that sequentially selects the best sub-policies at each time step based on the contextual information derived from demonstrations. Compared with state-of-the-art baselines on real-world medical data, HIL improves the likelihood of patient survival and provides better dynamic treatment regimes with the exploitation of hierarchical structures in expert demonstrations.

Lu Wang, Wenchao Yu, Wei Cheng, Bo Zong, Haifeng Chen
-
[ Visit Poster at Spot A5 in Virtual World ]

Reinforcement Learning (RL) is a promising approach for solving various control, optimization, and sequential decision making tasks. However, designing reward functions for complex tasks (e.g., with multiple objectives and safety constraints) can be challenging for most users and usually requires multiple expensive trials (reward function “hacking”). In this paper we propose a specification language (Inkling Goal Specification) for complex control and optimization tasks, which is very close to natural language and allows a practitioner to focus on problem specification instead of reward function hacking. The core elements of our framework are: (i) mapping the high level language to a predicate temporal logic tailored to control and optimization tasks, (ii) a novel automaton-guided dense reward generation that can be used to drive RL algorithms, and (iii) a set of performance metrics to assess the behavior of the system. We include a set of experiments showing that the proposed method provides great ease of use to specify a wide range of real world tasks; and that the reward generated is able to drive the policy training to achieve the specified goal.

Xuan Zhao
-
[ Visit Poster at Spot B3 in Virtual World ]

The ability to autonomously learn behaviors via direct interactions in uninstrumented environments can lead to generalist robots capable of enhancing productivity or providing care in unstructured settings like homes. Such uninstrumented settings warrant operations only using the robot’s proprioceptive sensor such as onboard cameras, joint encoders, etc which can be challenging for policy learning owing to the high dimensionality and partial observability issues. We propose RRL: Resnet as representation for Reinforcement Learning – a straightforward yet effective approach that can learn complex behaviors directly from proprioceptive inputs. RRL fuses features extracted from pre-trained Resnet into the standard reinforcement learning pipeline and delivers results comparable to learning directly from the state. In a simulated dexterous manipulation benchmark, where the state of the art methods fails to make significant progress, RRL delivers contact rich behaviors. The appeal of RRL lies in its simplicity in bringing together progress from the fields of Representation Learning, Imitation Learning, and Reinforcement Learning. Its effectiveness in learning behaviors directly from visual inputs with performance and sample efficiency matching learning directly from the state, even in complex high dimensional domains, is far from obvious.

Rutav Shah, Vikash Kumar
-
The MineRL Competitions at NeurIPS 2021 (Poster) [ Visit Poster at Spot B2 in Virtual World ]
Cody Wild, Stephanie Milani
-
IGLU: Interactive Grounded Language Understanding in a Collaborative Environment (Poster) [ Visit Poster at Spot B4 in Virtual World ]
Julia Kiseleva, Julia Kiseleva

Author Information

Yuxi Li (attain.ai)
Minmin Chen (Google research)
Omer Gottesman (Harvard University)
Lihong Li (Amazon)
Zongqing Lu (Peking University)
Rupam Mahmood (University of Alberta)
Niranjani Prasad (Microsoft Research Cambridge)
Zhiwei (Tony) Qin (Didi Research America)
Csaba Szepesvari (Deepmind)
Matthew Taylor (U. of Alberta)

More from the Same Authors