Timezone: »
We consider un-discounted reinforcement learning (RL) in Markov decision processes (MDPs) under drifting non-stationarity, \ie, both the reward and state transition distributions are allowed to evolve over time, as long as their respective total variations, quantified by suitable metrics, do not exceed certain \emph{variation budgets}. We first develop the Sliding Window Upper-Confidence bound for Reinforcement Learning with Confidence Widening (\texttt{SWUCRL2-CW}) algorithm, and establish its dynamic regret bound when the variation budgets are known. In addition, we propose the Bandit-over-Reinforcement Learning (\texttt{BORL}) algorithm to adaptively tune the \sw~to achieve the same dynamic regret bound, but in a \emph{parameter-free} manner, \ie, without knowing the variation budgets. Notably, learning drifting MDPs via conventional optimistic exploration presents a unique challenge absent in existing (non-stationary) bandit learning settings. We overcome the challenge by a novel confidence widening technique that incorporates additional optimism.
Author Information
Wang Chi Cheung (National University of Singapore)
David Simchi-Levi (MIT)
Ruihao Zhu (MIT)
More from the Same Authors
-
2023 Poster: Bandits with Knapsacks: Advice on Time-Varying Demands »
Lixing Lyu · Wang Chi Cheung -
2023 Poster: Pricing Experimental Design: Causal Effect, Expected Revenue and Tail Risk »
David Simchi-Levi · Chonghuan Wang -
2021 Poster: Near-Optimal Model-Free Reinforcement Learning in Non-Stationary Episodic MDPs »
Weichao Mao · Kaiqing Zhang · Ruihao Zhu · David Simchi-Levi · Tamer Basar -
2021 Poster: Dynamic Planning and Learning under Recovering Rewards »
David Simchi-Levi · Zeyu Zheng · Feng Zhu -
2021 Spotlight: Near-Optimal Model-Free Reinforcement Learning in Non-Stationary Episodic MDPs »
Weichao Mao · Kaiqing Zhang · Ruihao Zhu · David Simchi-Levi · Tamer Basar -
2021 Spotlight: Dynamic Planning and Learning under Recovering Rewards »
David Simchi-Levi · Zeyu Zheng · Feng Zhu -
2021 Poster: Probabilistic Sequential Shrinking: A Best Arm Identification Algorithm for Stochastic Bandits with Corruptions »
Zixin Zhong · Wang Chi Cheung · Vincent Tan -
2021 Spotlight: Probabilistic Sequential Shrinking: A Best Arm Identification Algorithm for Stochastic Bandits with Corruptions »
Zixin Zhong · Wang Chi Cheung · Vincent Tan -
2020 Poster: Best Arm Identification for Cascading Bandits in the Fixed Confidence Setting »
Zixin Zhong · Wang Chi Cheung · Vincent Tan -
2020 Poster: Online Pricing with Offline Data: Phase Transition and Inverse Square Law »
Jinzhi Bu · David Simchi-Levi · Yunzong Xu