Doubly Regularized Markov Decision Processes for Robust Reinforcement Learning
Abstract
Empirical successes show that regularization improves the stability and efficiency of reinforcement learning (RL), with applications in robotics and post-training of large language models. Yet, theoretical analyses of regularized Markov decision processes (MDPs) have mostly been confined to the standard RL setting. In this work, we investigate regularized MDPs through the lens of robust RL. We introduce a doubly regularized MDP framework that combines policy and dynamics regularization to enable robust policy learning against reward and dynamics perturbations. Within this framework, we develop an optimism-based online algorithm and provide the first finite-sample regret guarantees in both tabular and rich-observation settings, where the state-action space may be continuous. Our results show that algorithms for doubly regularized MDPs are as sample-efficient as well-studied robust MDP algorithms, while additionally benefiting from the flexibility of soft policies. Finally, we use experiments to demonstrate that our approach efficiently and effectively handles function approximation and exploration in large state-action spaces, achieving robust performances.