Why Dedicated Critics: Eliminating Target Drift in Multi-Constraint RL
Abstract
Lagrangian-based methodologies are one of the fundamental paradigms of safe reinforcement learning (RL) for constrained Markov decision processes, particularly when dealing with multi-constraint cases. While the specific details of the methodologies may differ, with some using a single estimator for the overall mixed penalty term of the constraints and others using separate estimators for the constraints, the fundamental question of the theoretical validity of the methodologies has remained largely unexplored. The present paper performs the first theoretical analysis of the methodologies and proves that the use of the mixed critic structure leads to the presence of a bias due to the target drift of the Lagrange multipliers. On the other hand, the use of the dedicated critic structure, where separate critics are used for the reward function and the constraint functions, does not suffer from this bias. The theoretical analysis is supported with experiments on a realistic power system environment with multiple constraints, where the dedicated critic structure succeeds in satisfying the constraints, whereas the mixed critic structure fails.