Timezone: »
We propose Adversarially Trained Actor Critic (ATAC), a new model-free algorithm for offline reinforcement learning (RL) under insufficient data coverage, based on the concept of relative pessimism. ATAC is designed as a two-player Stackelberg game framing of offline RL: A policy actor competes against an adversarially trained value critic, who finds data-consistent scenarios where the actor is inferior to the data-collection behavior policy. We prove that, when the actor attains no regret in the two-player game, running ATAC produces a policy that provably 1) outperforms the behavior policy over a wide range of hyperparameters that control the degree of pessimism, and 2) competes with the best policy covered by data with appropriately chosen hyperparameters. Compared with existing works, notably our framework offers both theoretical guarantees for general function approximation and a deep RL implementation scalable to complex environments and large datasets. In the D4RL benchmark, ATAC consistently outperforms state-of-the-art offline RL algorithms on a range of continuous control tasks.
Author Information
Ching-An Cheng (Microsoft Research)
Tengyang Xie (University of Illinois at Urbana-Champaign)
Nan Jiang (University of Illinois at Urbana-Champaign)
Alekh Agarwal (Microsoft Research)
Related Events (a corresponding poster, oral, or spotlight)
-
2022 Poster: Adversarially Trained Actor Critic for Offline Reinforcement Learning »
Thu. Jul 21st through Fri the 22nd Room Hall E #914
More from the Same Authors
-
2021 : Provable RL with Exogenous Distractors via Multistep Inverse Dynamics »
Yonathan Efroni · Dipendra Misra · Akshay Krishnamurthy · Alekh Agarwal · John Langford -
2021 : A Spectral Approach to Off-Policy Evaluation for POMDPs »
Yash Nair · Nan Jiang -
2021 : Policy Finetuning: Bridging Sample-Efficient Offline and Online Reinforcement Learning »
Tengyang Xie · Nan Jiang · Huan Wang · Caiming Xiong · Yu Bai -
2022 : Interaction-Grounded Learning with Action-inclusive Feedback »
Tengyang Xie · Akanksha Saran · Dylan Foster · Lekan Molu · Ida Momennejad · Nan Jiang · Paul Mineiro · John Langford -
2022 : Beyond the Return: Off-policy Function Estimation under User-specified Error-measuring Distributions »
Audrey Huang · Nan Jiang -
2023 : Survival Instinct in Offline Reinforcement Learning and Implicit Human Bias in Data »
Anqi Li · Dipendra Misra · Andrey Kolobov · Ching-An Cheng -
2023 Workshop: Interactive Learning with Implicit Human Feedback »
Andi Peng · Akanksha Saran · Andreea Bobu · Tengyang Xie · Pierre-Yves Oudeyer · Anca Dragan · John Langford -
2023 Poster: Stochastic Gradient Succeeds for Bandits »
Jincheng Mei · Zixin Zhong · Bo Dai · Alekh Agarwal · Csaba Szepesvari · Dale Schuurmans -
2023 Poster: Learning in POMDPs is Sample-Efficient with Hindsight Observability »
Jonathan Lee · Alekh Agarwal · Christoph Dann · Tong Zhang -
2023 Poster: Offline Learning in Markov Games with General Function Approximation »
Yuheng Zhang · Yu Bai · Nan Jiang -
2023 Poster: Provable Reset-free Reinforcement Learning by No-Regret Reduction »
Hoai-An Nguyen · Ching-An Cheng -
2023 Poster: The Optimal Approximation Factors in Misspecified Off-Policy Value Function Estimation »
Philip Amortila · Nan Jiang · Csaba Szepesvari -
2023 Poster: Reinforcement Learning in Low-rank MDPs with Density Features »
Audrey Huang · Jinglin Chen · Nan Jiang -
2023 Poster: MAHALO: Unifying Offline Reinforcement Learning and Imitation Learning from Observations »
Anqi Li · Byron Boots · Ching-An Cheng -
2023 Poster: Hindsight Learning for MDPs with Exogenous Inputs »
Sean R. Sinclair · Felipe Vieira Frujeri · Ching-An Cheng · Luke Marshall · Hugo Barbalho · Jingling Li · Jennifer Neville · Ishai Menache · Adith Swaminathan -
2022 Poster: Efficient Reinforcement Learning in Block MDPs: A Model-free Representation Learning approach »
Xuezhou Zhang · Yuda Song · Masatoshi Uehara · Mengdi Wang · Alekh Agarwal · Wen Sun -
2022 Spotlight: Efficient Reinforcement Learning in Block MDPs: A Model-free Representation Learning approach »
Xuezhou Zhang · Yuda Song · Masatoshi Uehara · Mengdi Wang · Alekh Agarwal · Wen Sun -
2022 Poster: A Minimax Learning Approach to Off-Policy Evaluation in Confounded Partially Observable Markov Decision Processes »
Chengchun Shi · Masatoshi Uehara · Jiawei Huang · Nan Jiang -
2022 Oral: A Minimax Learning Approach to Off-Policy Evaluation in Confounded Partially Observable Markov Decision Processes »
Chengchun Shi · Masatoshi Uehara · Jiawei Huang · Nan Jiang -
2021 : RL + Recommender Systems Panel »
Alekh Agarwal · Ed Chi · Maria Dimakopoulou · Georgios Theocharous · Minmin Chen · Lihong Li -
2021 Poster: Interaction-Grounded Learning »
Tengyang Xie · John Langford · Paul Mineiro · Ida Momennejad -
2021 Poster: Provably Correct Optimization and Exploration with Non-linear Policies »
Fei Feng · Wotao Yin · Alekh Agarwal · Lin Yang -
2021 Poster: Safe Reinforcement Learning Using Advantage-Based Intervention »
Nolan Wagener · Byron Boots · Ching-An Cheng -
2021 Spotlight: Safe Reinforcement Learning Using Advantage-Based Intervention »
Nolan Wagener · Byron Boots · Ching-An Cheng -
2021 Spotlight: Provably Correct Optimization and Exploration with Non-linear Policies »
Fei Feng · Wotao Yin · Alekh Agarwal · Lin Yang -
2021 Spotlight: Interaction-Grounded Learning »
Tengyang Xie · John Langford · Paul Mineiro · Ida Momennejad -
2021 Poster: Batch Value-function Approximation with Only Realizability »
Tengyang Xie · Nan Jiang -
2021 Spotlight: Batch Value-function Approximation with Only Realizability »
Tengyang Xie · Nan Jiang -
2020 Poster: Minimax Weight and Q-Function Learning for Off-Policy Evaluation »
Masatoshi Uehara · Jiawei Huang · Nan Jiang -
2020 Poster: From Importance Sampling to Doubly Robust Policy Gradient »
Jiawei Huang · Nan Jiang -
2019 Poster: Warm-starting Contextual Bandits: Robustly Combining Supervised and Bandit Feedback »
Chicheng Zhang · Alekh Agarwal · Hal Daumé III · John Langford · Sahand Negahban -
2019 Poster: Fair Regression: Quantitative Definitions and Reduction-Based Algorithms »
Alekh Agarwal · Miroslav Dudik · Steven Wu -
2019 Oral: Warm-starting Contextual Bandits: Robustly Combining Supervised and Bandit Feedback »
Chicheng Zhang · Alekh Agarwal · Hal Daumé III · John Langford · Sahand Negahban -
2019 Oral: Fair Regression: Quantitative Definitions and Reduction-Based Algorithms »
Alekh Agarwal · Miroslav Dudik · Steven Wu -
2019 Poster: Provably efficient RL with Rich Observations via Latent State Decoding »
Simon Du · Akshay Krishnamurthy · Nan Jiang · Alekh Agarwal · Miroslav Dudik · John Langford -
2019 Poster: Information-Theoretic Considerations in Batch Reinforcement Learning »
Jinglin Chen · Nan Jiang -
2019 Oral: Information-Theoretic Considerations in Batch Reinforcement Learning »
Jinglin Chen · Nan Jiang -
2019 Oral: Provably efficient RL with Rich Observations via Latent State Decoding »
Simon Du · Akshay Krishnamurthy · Nan Jiang · Alekh Agarwal · Miroslav Dudik · John Langford -
2018 Poster: Hierarchical Imitation and Reinforcement Learning »
Hoang Le · Nan Jiang · Alekh Agarwal · Miroslav Dudik · Yisong Yue · Hal Daumé III -
2018 Poster: A Reductions Approach to Fair Classification »
Alekh Agarwal · Alina Beygelzimer · Miroslav Dudik · John Langford · Hanna Wallach -
2018 Oral: Hierarchical Imitation and Reinforcement Learning »
Hoang Le · Nan Jiang · Alekh Agarwal · Miroslav Dudik · Yisong Yue · Hal Daumé III -
2018 Oral: A Reductions Approach to Fair Classification »
Alekh Agarwal · Alina Beygelzimer · Miroslav Dudik · John Langford · Hanna Wallach -
2018 Poster: Practical Contextual Bandits with Regression Oracles »
Dylan Foster · Alekh Agarwal · Miroslav Dudik · Haipeng Luo · Robert Schapire -
2018 Oral: Practical Contextual Bandits with Regression Oracles »
Dylan Foster · Alekh Agarwal · Miroslav Dudik · Haipeng Luo · Robert Schapire -
2017 : Corralling a Band of Bandit Algorithms »
Alekh Agarwal -
2017 Poster: Contextual Decision Processes with low Bellman rank are PAC-Learnable »
Nan Jiang · Akshay Krishnamurthy · Alekh Agarwal · John Langford · Robert Schapire -
2017 Poster: Optimal and Adaptive Off-policy Evaluation in Contextual Bandits »
Yu-Xiang Wang · Alekh Agarwal · Miroslav Dudik -
2017 Talk: Contextual Decision Processes with low Bellman rank are PAC-Learnable »
Nan Jiang · Akshay Krishnamurthy · Alekh Agarwal · John Langford · Robert Schapire -
2017 Talk: Optimal and Adaptive Off-policy Evaluation in Contextual Bandits »
Yu-Xiang Wang · Alekh Agarwal · Miroslav Dudik -
2017 Poster: Active Learning for Cost-Sensitive Classification »
Akshay Krishnamurthy · Alekh Agarwal · Tzu-Kuo Huang · Hal Daumé III · John Langford -
2017 Talk: Active Learning for Cost-Sensitive Classification »
Akshay Krishnamurthy · Alekh Agarwal · Tzu-Kuo Huang · Hal Daumé III · John Langford -
2017 Tutorial: Real World Interactive Learning »
Alekh Agarwal · John Langford