ICML 2022 Learning Infinite-horizon Average-reward Markov Decision Process with Constraints Spotlight

Spotlight

Learning Infinite-horizon Average-reward Markov Decision Process with Constraints

Liyu Chen · Rahul Jain · Haipeng Luo

Hall F

[ Abstract ] [ Visit Reinforcement Learning ]

[ Slides] [ Paper PDF]

Abstract: We study regret minimization for infinite-horizon average-reward Markov Decision Processes (MDPs) under cost constraints.We start by designing a policy optimization algorithm with carefully designed action-value estimator and bonus term,and show that for ergodic MDPs, our algorithm ensures

O (\sqrt{T})

$O(\sqrt{T})$ regret and constant constraint violation, where

T

$T$ is the total number of time steps.This strictly improves over the algorithm of (Singh et al., 2020), whose regret and constraint violation are both

O (T^{2 / 3})

$O(T^{2/3})$ .Next, we consider the most general class of weakly communicating MDPs. Through a finite-horizon approximation, we develop another algorithm with

O (T^{2 / 3})

$O(T^{2/3})$ regret and constraint violation, which can be further improved to

O (\sqrt{T})

$O(\sqrt{T})$ via a simple modification,albeit making the algorithm computationally inefficient.As far as we know, these are the first set of provable algorithms for weakly communicating MDPs with cost constraints.

Chat is not available.