Revisiting Regularized Policy Optimization for Stable and Efficient Reinforcement Learning in Two-Player Games
Abstract
Two-player games such as board games have long been used as traditional benchmark for reinforcement learning. This work revisits a regularized policy optimization with reverse Kullback-Leibler divergence and entropy divergence and analyzes this combination on two-player zero-sum settings from theoretical and empirical perspectives. From a theoretical perspective, we investigate the stability of the policy update rule on two theoretical settings: game-theoretic normal-form games and finite-length games. We provide convergence guarantees and verify our theoretical results by numerical experiments on synthetic games. From an empirical perspective, we derive a practical model-free reinforcement learning algorithm based on the regularized policy optimization. We validate the efficiency in training of our algorithm through comprehensive experiments on five board games: Animal Shogi, Gardner Chess, Go, Hex, and Othello. The experimental results demonstrate that our agent achieves more efficient learning than existing methods across the environments.