Towards Achieving Optimal Strong Regret and Constraint Violation via Computational Efficient Model-free RL
Xiyue Peng ⋅ Lingkai Zu ⋅ Ziyu Shao ⋅ Xin Liu
Abstract
We study episodic constrained Markov decision processes (CMDPs) with linear function approximation, where the goal is to achieve strong regret and constraint violation guarantees without allowing error cancellations. Unlike the existing work, which focuses on either tabular CMDP or model-based reinforcement learning methods. We propose a model-free policy APMPO that achieves near-optimal $\widetilde{O}(\sqrt{K})$ strong regret and strong constraint violation with Slater's condition (or strict feasibility assumption), where $K$ is the total number of episodes. It matches the best-known rates without requiring any prior knowledge of the feasibility gap reported in prior model-based work for tabular CMDPs. Besides, APMPO achieves $\widetilde{O}(K^{\frac{3}{4}})$ strong regret and $\widetilde{O}(K^\frac{3}{4})$ strong constraint violation without Slater's condition. To the best of our knowledge, this is the first sublinear result of CMDP w.r.t. the strong metrics without Slater's condition. APMPO achieves these results by a novel and adaptive design of a violation-aware penalty and learning rates to balance the strong regret and constraint violation, which is quite different from the (regularized) primal-dual methods imposing constraints via dual penalty in the literature. The experiments show APMPO significantly outperforms the strong baselines, which justify our design and theoretical performance.
Successful Page Load