Timezone: »
Conservatism has led to significant progress in offline reinforcement learning (RL) where an agent learns from pre-collected datasets. However, as many real-world scenarios involve interaction among multiple agents, it is important to resolve offline RL in the multi-agent setting. Given the recent success of transferring online RL algorithms to the multi-agent setting, one may expect that offline RL algorithms will also transfer to multi-agent settings directly. Surprisingly, we empirically observe that conservative offline RL algorithms do not work well in the multi-agent setting---the performance degrades significantly with an increasing number of agents. Towards mitigating the degradation, we identify a key issue that non-concavity of the value function makes the policy gradient improvements prone to local optima. Multiple agents exacerbate the problem severely, since the suboptimal policy by any agent can lead to uncoordinated global failure. Following this intuition, we propose a simple yet effective method, Offline Multi-Agent RL with Actor Rectification (OMAR), which combines the first-order policy gradients and zeroth-order optimization methods to better optimize the conservative value functions over the actor parameters. Despite the simplicity, OMAR achieves state-of-the-art results in a variety of multi-agent control tasks.
Author Information
Ling Pan (IIIS, Tsinghua University)
Longbo Huang (Tsinghua University)
Tengyu Ma (Stanford)
Huazhe Xu (Stanford University)
Related Events (a corresponding poster, oral, or spotlight)
-
2022 Poster: Plan Better Amid Conservatism: Offline Multi-Agent Reinforcement Learning with Actor Rectification »
Thu. Jul 21st through Fri the 22nd Room Hall E #801
More from the Same Authors
-
2021 : Label Noise SGD Provably Prefers Flat Global Minimizers »
Alex Damian · Tengyu Ma · Jason Lee -
2021 : The best of both worlds: stochastic and adversarial episodic MDPs with unknown transition »
Tiancheng Jin · Longbo Huang · Haipeng Luo -
2021 : Value-Based Deep Reinforcement Learning Requires Explicit Regularization »
Aviral Kumar · Rishabh Agarwal · Aaron Courville · Tengyu Ma · George Tucker · Sergey Levine -
2021 : Learning Barrier Certificates: Towards Safe Reinforcement Learning with Zero Training-time Violations »
Yuping Luo · Tengyu Ma -
2021 : Value-Based Deep Reinforcement Learning Requires Explicit Regularization »
Aviral Kumar · Rishabh Agarwal · Aaron Courville · Tengyu Ma · George Tucker · Sergey Levine -
2021 : Value-Based Deep Reinforcement Learning Requires Explicit Regularization »
Aviral Kumar · Rishabh Agarwal · Aaron Courville · Tengyu Ma · George Tucker · Sergey Levine -
2022 : Pre-Trained Image Encoder for Generalizable Visual Reinforcement Learning »
Zhecheng Yuan · Zhecheng Yuan · Zhengrong Xue · Zhengrong Xue · Bo Yuan · Bo Yuan · Xueqian Wang · Xueqian Wang · Yi Wu · Yi Wu · Yang Gao · Yang Gao · Huazhe Xu · Huazhe Xu -
2023 Poster: Multi-task Representation Learning for Pure Exploration in Linear Bandits »
Yihan Du · Longbo Huang · Wen Sun -
2023 Poster: Banker Online Mirror Descent: A Universal Approach for Delayed Online Bandit Learning »
Jiatai Huang · Yan Dai · Longbo Huang -
2023 Poster: Better Training of GFlowNets with Local Credit and Incomplete Trajectories »
Ling Pan · Nikolay Malkin · Dinghuai Zhang · Yoshua Bengio -
2023 Tutorial: Recent Advances in the Generalization Theory of Neural Networks * »
Tengyu Ma · Alex Damian -
2022 Poster: Near-Optimal Algorithms for Autonomous Exploration and Multi-Goal Stochastic Shortest Path »
Haoyuan Cai · Tengyu Ma · Simon Du -
2022 Spotlight: Near-Optimal Algorithms for Autonomous Exploration and Multi-Goal Stochastic Shortest Path »
Haoyuan Cai · Tengyu Ma · Simon Du -
2022 Poster: Nearly Minimax Optimal Reinforcement Learning with Linear Function Approximation »
Pihe Hu · Yu Chen · Longbo Huang -
2022 Poster: Modality Competition: What Makes Joint Training of Multi-modal Network Fail in Deep Learning? (Provably) »
Yu Huang · Junyang Lin · Chang Zhou · Hongxia Yang · Longbo Huang -
2022 Poster: Phasic Self-Imitative Reduction for Sparse-Reward Goal-Conditioned Reinforcement Learning »
Yunfei Li · Tian Gao · Jiaqi Yang · Huazhe Xu · Yi Wu -
2022 Spotlight: Modality Competition: What Makes Joint Training of Multi-modal Network Fail in Deep Learning? (Provably) »
Yu Huang · Junyang Lin · Chang Zhou · Hongxia Yang · Longbo Huang -
2022 Spotlight: Nearly Minimax Optimal Reinforcement Learning with Linear Function Approximation »
Pihe Hu · Yu Chen · Longbo Huang -
2022 Spotlight: Phasic Self-Imitative Reduction for Sparse-Reward Goal-Conditioned Reinforcement Learning »
Yunfei Li · Tian Gao · Jiaqi Yang · Huazhe Xu · Yi Wu -
2022 Poster: Adaptive Best-of-Both-Worlds Algorithm for Heavy-Tailed Multi-Armed Bandits »
Jiatai Huang · Yan Dai · Longbo Huang -
2022 Spotlight: Adaptive Best-of-Both-Worlds Algorithm for Heavy-Tailed Multi-Armed Bandits »
Jiatai Huang · Yan Dai · Longbo Huang -
2021 : Value-Based Deep Reinforcement Learning Requires Explicit Regularization »
Aviral Kumar · Rishabh Agarwal · Aaron Courville · Tengyu Ma · George Tucker · Sergey Levine -
2020 Poster: On the Expressivity of Neural Networks for Deep Reinforcement Learning »
Kefan Dong · Yuping Luo · Tianhe (Kevin) Yu · Chelsea Finn · Tengyu Ma -
2020 Poster: The Implicit and Explicit Regularization Effects of Dropout »
Colin Wei · Sham Kakade · Tengyu Ma -
2020 Poster: Individual Calibration with Randomized Forecasting »
Shengjia Zhao · Tengyu Ma · Stefano Ermon -
2020 Poster: Understanding Self-Training for Gradual Domain Adaptation »
Ananya Kumar · Tengyu Ma · Percy Liang -
2020 Poster: Combinatorial Pure Exploration for Dueling Bandit »
Wei Chen · Yihan Du · Longbo Huang · Haoyu Zhao