A Sharper Global Convergence Analysis for Average Reward Reinforcement Learning via an Actor-Critic Approach
Swetha Ganesh · Washim Mondal · Vaneet Aggarwal
Abstract
This work examines average-reward reinforcement learning with general policy parametrization. Existing state-of-the-art (SOTA) guarantees for this problem are either suboptimal or hindered by several challenges, including poor scalability with respect to the size of the state-action space, high iteration complexity, and a significant dependence on knowledge of mixing times and hitting times. To address these limitations, we propose a Multi-level Monte Carlo-based Natural Actor-Critic (MLMC-NAC) algorithm. Our work is the first to achieve a global convergence rate of $\tilde{\mathcal{O}}(1/\sqrt{T})$ for average-reward Markov Decision Processes (MDPs) (where $T$ is the horizon length), using an Actor-Critic approach. Moreover, the convergence rate does not scale with the size of the state space, therefore even being applicable to infinite state spaces.
Lay Summary
This work focuses on training decision-making systems that aim to maximize long-term rewards without any discounting, a problem known as average-reward reinforcement learning. Existing theoretical results are either suboptimal or struggle to scale when faced with large state and action spaces. To address these limitations, we introduce a new method called MLMC-NAC, short for Multi-level Monte Carlo-based Natural Actor-Critic. The use of Multi-level Monte Carlo (MLMC) helps to efficiently reduce the bias from Markovian sampling.
Video
Chat is not available.
Successful Page Load