Timezone: »
The best of both worlds: stochastic and adversarial episodic MDPs with unknown transition
Tiancheng Jin · Longbo Huang · Haipeng Luo
We consider the best-of-both-worlds problem for learning an episodic Markov Decision Process through $T$ episodes, with the goal of achieving $O(\sqrt{T})$ regret when the losses are adversarial and simultaneously $O(\log T)$ regret when the losses are (almost) stochastic. Recent work by [Jin and Luo, 2020] achieves this goal when the fixed transition is known, and leaves the case of unknown transition as a major open question. In this work, we resolve this open problem by using the same Follow-the-Regularized-Leader (FTRL) framework together with a set of new techniques. Specifically, we first propose a loss-shifting trick in the FTRL analysis, which greatly simplifies the approach of [Jin and Luo, 2020] and already improves their results for the known transition case. Then, we extend this idea to the unknown transition case and develop a novel analysis which upper bounds the transition estimation error by the regret itself in the stochastic setting, a key property to ensure $O(\log T)$ regret.
Author Information
Tiancheng Jin (University of Southern California)
Longbo Huang (Tsinghua University)
Haipeng Luo (University of Southern California)
More from the Same Authors
-
2021 : Online Learning for Stochastic Shortest Path Model via Posterior Sampling »
Mehdi Jafarnia · Liyu Chen · Rahul Jain · Haipeng Luo -
2021 : Policy Optimization in Adversarial MDPs: Improved Exploration via Dilated Bonuses »
Haipeng Luo · Chen-Yu Wei · Chung-Wei Lee -
2021 : Implicit Finite-Horizon Approximation for Stochastic Shortest Path »
Liyu Chen · Mehdi Jafarnia · Rahul Jain · Haipeng Luo -
2022 : Near-optimal Regret for Adversarial MDP with Delayed Bandit Feedback »
Tiancheng Jin · Tal Lancewicki · Haipeng Luo · Yishay Mansour · Aviv Rosenberg -
2022 : Near-optimal Regret for Adversarial MDP with Delayed Bandit Feedback »
Tiancheng Jin · Tal Lancewicki · Haipeng Luo · Yishay Mansour · Aviv Rosenberg -
2022 Poster: Kernelized Multiplicative Weights for 0/1-Polyhedral Games: Bridging the Gap Between Learning in Extensive-Form and Normal-Form Games »
Gabriele Farina · Chung-Wei Lee · Haipeng Luo · Christian Kroer -
2022 Poster: Plan Better Amid Conservatism: Offline Multi-Agent Reinforcement Learning with Actor Rectification »
Ling Pan · Longbo Huang · Tengyu Ma · Huazhe Xu -
2022 Spotlight: Plan Better Amid Conservatism: Offline Multi-Agent Reinforcement Learning with Actor Rectification »
Ling Pan · Longbo Huang · Tengyu Ma · Huazhe Xu -
2022 Spotlight: Kernelized Multiplicative Weights for 0/1-Polyhedral Games: Bridging the Gap Between Learning in Extensive-Form and Normal-Form Games »
Gabriele Farina · Chung-Wei Lee · Haipeng Luo · Christian Kroer -
2022 Poster: No-Regret Learning in Time-Varying Zero-Sum Games »
Mengxiao Zhang · Peng Zhao · Haipeng Luo · Zhi-Hua Zhou -
2022 Poster: Nearly Minimax Optimal Reinforcement Learning with Linear Function Approximation »
Pihe Hu · Yu Chen · Longbo Huang -
2022 Poster: Modality Competition: What Makes Joint Training of Multi-modal Network Fail in Deep Learning? (Provably) »
Yu Huang · Junyang Lin · Chang Zhou · Hongxia Yang · Longbo Huang -
2022 Spotlight: Modality Competition: What Makes Joint Training of Multi-modal Network Fail in Deep Learning? (Provably) »
Yu Huang · Junyang Lin · Chang Zhou · Hongxia Yang · Longbo Huang -
2022 Spotlight: No-Regret Learning in Time-Varying Zero-Sum Games »
Mengxiao Zhang · Peng Zhao · Haipeng Luo · Zhi-Hua Zhou -
2022 Spotlight: Nearly Minimax Optimal Reinforcement Learning with Linear Function Approximation »
Pihe Hu · Yu Chen · Longbo Huang -
2022 Poster: Adaptive Best-of-Both-Worlds Algorithm for Heavy-Tailed Multi-Armed Bandits »
Jiatai Huang · Yan Dai · Longbo Huang -
2022 Poster: Improved No-Regret Algorithms for Stochastic Shortest Path with Linear MDP »
Liyu Chen · Rahul Jain · Haipeng Luo -
2022 Poster: Learning Infinite-horizon Average-reward Markov Decision Process with Constraints »
Liyu Chen · Rahul Jain · Haipeng Luo -
2022 Spotlight: Adaptive Best-of-Both-Worlds Algorithm for Heavy-Tailed Multi-Armed Bandits »
Jiatai Huang · Yan Dai · Longbo Huang -
2022 Spotlight: Learning Infinite-horizon Average-reward Markov Decision Process with Constraints »
Liyu Chen · Rahul Jain · Haipeng Luo -
2022 Oral: Improved No-Regret Algorithms for Stochastic Shortest Path with Linear MDP »
Liyu Chen · Rahul Jain · Haipeng Luo -
2021 : Implicit Finite-Horizon Approximation for Stochastic Shortest Path »
Liyu Chen · Mehdi Jafarnia · Rahul Jain · Haipeng Luo -
2021 Poster: Achieving Near Instance-Optimality and Minimax-Optimality in Stochastic and Adversarial Linear Bandits Simultaneously »
Chung-Wei Lee · Haipeng Luo · Chen-Yu Wei · Mengxiao Zhang · Xiaojin Zhang -
2021 Poster: Finding the Stochastic Shortest Path with Low Regret: the Adversarial Cost and Unknown Transition Case »
Liyu Chen · Haipeng Luo -
2021 Spotlight: Finding the Stochastic Shortest Path with Low Regret: the Adversarial Cost and Unknown Transition Case »
Liyu Chen · Haipeng Luo -
2021 Spotlight: Achieving Near Instance-Optimality and Minimax-Optimality in Stochastic and Adversarial Linear Bandits Simultaneously »
Chung-Wei Lee · Haipeng Luo · Chen-Yu Wei · Mengxiao Zhang · Xiaojin Zhang -
2020 Poster: Learning Adversarial Markov Decision Processes with Bandit Feedback and Unknown Transition »
Chi Jin · Tiancheng Jin · Haipeng Luo · Suvrit Sra · Tiancheng Yu -
2020 Poster: Combinatorial Pure Exploration for Dueling Bandit »
Wei Chen · Yihan Du · Longbo Huang · Haoyu Zhao -
2018 Poster: Practical Contextual Bandits with Regression Oracles »
Dylan Foster · Alekh Agarwal · Miroslav Dudik · Haipeng Luo · Robert Schapire -
2018 Oral: Practical Contextual Bandits with Regression Oracles »
Dylan Foster · Alekh Agarwal · Miroslav Dudik · Haipeng Luo · Robert Schapire