Poster

Dynamic Regret of Online Markov Decision Processes

Peng Zhao ⋅ Long-Fei Li ⋅ Zhi-Hua Zhou

Keywords: T: Online Learning and Bandits RL: Online

2022 Poster

[ Poster] [ Paper PDF]

Abstract

We investigate online Markov Decision Processes~(MDPs) with adversarially changing loss functions and known transitions. We choose \emph{dynamic regret} as the performance measure, defined as the performance difference between the learner and any sequence of feasible \emph{changing} policies. The measure is strictly stronger than the standard static regret that benchmarks the learner's performance with a fixed compared policy. We consider three foundational models of online MDPs, including episodic loop-free Stochastic Shortest Path (SSP), episodic SSP, and infinite-horizon MDPs. For the three models, we propose novel online ensemble algorithms and establish their dynamic regret guarantees respectively, in which the results for episodic (loop-free) SSP are provably minimax optimal in terms of time horizon and certain non-stationarity measure.

Chat is not available.