Skip to yearly menu bar Skip to main content


Poster

Global Optimality without Mixing Time Oracles in Average-reward RL Multi-level Actor-Critic

Bhrij Patel · Wesley A. Suttle · Alec Koppel · Vaneet Aggarwal · Brian Sadler · Amrit Bedi · Dinesh Manocha


Abstract: In the context of average-reward reinforcement learning, the requirement for oracle knowledge of the mixing time, a measure of the duration a Markov chain under a fixed policy needs to achieve its stationary distribution—poses a significant challenge for the global convergence of policy gradient methods. This requirement is particularly problematic due to the difficulty and expense of estimating mixing time in environments with large state spaces, leading to the necessity of impractically long trajectories for effective gradient estimation in practical applications. To address this limitation, we consider the Multi-level Actor-Critic (MAC) framework, which incorporates a Multi-level Monte-Carlo (MLMC) gradient estimator. With our approach, we effectively alleviate the dependency on mixing time knowledge, a first for average-reward MDPs global convergence. Furthermore, our approach exhibits the tightest-available dependence of $O(\sqrt{\tau_{mix}})$ relative to prior work. With a 2D gridworld goal-reaching navigation experiment, we demonstrate the MAC achieves higher reward than a previous PG-based method for average reward Parameterized Policy Gradient with Advantage Estimation (PPGAE), especially in cases with relatively small training sample budget restricting trajectory length.

Live content is unavailable. Log in and register to view live content