ICML 2022 Dynamic Regret of Online Markov Decision Processes Spotlight

Spotlight

Dynamic Regret of Online Markov Decision Processes

Peng Zhao · Long-Fei Li · Zhi-Hua Zhou

[ Abstract ] [ Visit Reinforcement Learning ]

[ Slides] [ Paper PDF]

Abstract:

We investigate online Markov Decision Processes~(MDPs) with adversarially changing loss functions and known transitions. We choose \emph{dynamic regret} as the performance measure, defined as the performance difference between the learner and any sequence of feasible \emph{changing} policies. The measure is strictly stronger than the standard static regret that benchmarks the learner's performance with a fixed compared policy. We consider three foundational models of online MDPs, including episodic loop-free Stochastic Shortest Path (SSP), episodic SSP, and infinite-horizon MDPs. For the three models, we propose novel online ensemble algorithms and establish their dynamic regret guarantees respectively, in which the results for episodic (loop-free) SSP are provably minimax optimal in terms of time horizon and certain non-stationarity measure.

Chat is not available.