Timezone: »
We present a novel observation about the behavior of offline reinforcement learning (RL) algorithms: on many benchmark datasets, offline RL can produce well-performing and safe policies even when trained with "wrong" reward labels, such as those that are zero everywhere or are negatives of the true rewards. This phenomenon cannot be easily explained by offline RL's return maximization objective. Moreover, it gives offline RL a degree of robustness that is uncharacteristic of its online RL counterparts, which are known to be sensitive to reward design. We demonstrate that this surprising robustness property is attributable to an interplay between the notion of pessimism in offline RL algorithms and a certain human bias implicit in common data collection practices. As we prove in this work, pessimism endows the agent with a survival instinct, i.e., an incentive to stay within the data support in the long term, while the limited and biased data coverage further constrains the set of survival policies. We argue that the survival instinct should be taken into account when interpreting results from existing offline RL benchmarks and when creating future ones. Our empirical and theoretical results suggest a new paradigm for RL, whereby an agent is "nudged" to learn a desirable behavior with imperfect reward but purposely biased data coverage.
Author Information
Anqi Li (University of Washington)
Dipendra Misra
Andrey Kolobov (Microsoft Research)
Ching-An Cheng (Microsoft Research)
More from the Same Authors
-
2023 Poster: Provable Reset-free Reinforcement Learning by No-Regret Reduction »
Hoai-An Nguyen · Ching-An Cheng -
2023 Poster: Principled Offline RL in the Presence of Rich Exogenous Information »
Riashat Islam · Manan Tomar · Alex Lamb · Yonathan Efroni · Hongyu Zang · Aniket Didolkar · Dipendra Misra · Xin Li · Harm Seijen · Remi Tachet des Combes · John Langford -
2023 Poster: MAHALO: Unifying Offline Reinforcement Learning and Imitation Learning from Observations »
Anqi Li · Byron Boots · Ching-An Cheng -
2023 Poster: Hindsight Learning for MDPs with Exogenous Inputs »
Sean R. Sinclair · Felipe Vieira Frujeri · Ching-An Cheng · Luke Marshall · Hugo Barbalho · Jingling Li · Jennifer Neville · Ishai Menache · Adith Swaminathan -
2022 Poster: Adversarially Trained Actor Critic for Offline Reinforcement Learning »
Ching-An Cheng · Tengyang Xie · Nan Jiang · Alekh Agarwal -
2022 Oral: Adversarially Trained Actor Critic for Offline Reinforcement Learning »
Ching-An Cheng · Tengyang Xie · Nan Jiang · Alekh Agarwal -
2021 Poster: Safe Reinforcement Learning Using Advantage-Based Intervention »
Nolan Wagener · Byron Boots · Ching-An Cheng -
2021 Spotlight: Safe Reinforcement Learning Using Advantage-Based Intervention »
Nolan Wagener · Byron Boots · Ching-An Cheng -
2020 Poster: Online Learning for Active Cache Synchronization »
Andrey Kolobov · Sebastien Bubeck · Julian Zimmert -
2019 : posters »
Zhengxing Chen · Juan Jose Garau Luis · Ignacio Albert Smet · Aditya Modi · Sabina Tomkins · Riley Simmons-Edler · Hongzi Mao · Alexander Irpan · Hao Lu · Rose Wang · Subhojyoti Mukherjee · Aniruddh Raghu · Syed Arbab Mohd Shihab · Byung Hoon Ahn · Rasool Fakoor · Pratik Chaudhari · Elena Smirnova · Min-hwan Oh · Xiaocheng Tang · Tony Qin · Qingyang Li · Marc Brittain · Ian Fox · Supratik Paul · Xiaofeng Gao · Yinlam Chow · Gabriel Dulac-Arnold · Ofir Nachum · Nikos Karampatziakis · Bharathan Balaji · Supratik Paul · Ali Davody · Djallel Bouneffouf · Himanshu Sahni · Soo Kim · Andrey Kolobov · Alexander Amini · Yao Liu · Xinshi Chen · · Craig Boutilier