Constrained Meta Reinforcement Learning with Provable Test-Time Safety
Abstract
Meta reinforcement learning (RL) allows agents to leverage experience across a distribution of tasks on which the agent can train at will, enabling faster learning of optimal policies on new test tasks. Despite its success in improving sample complexity on test tasks, many real-world applications, such as robotics and healthcare, impose safety constraints during testing. Constrained meta RL provides a promising framework for integrating safety into meta RL. The key challenge is to learn optimal policies while ensuring safe exploration, meaning that policies must remain feasible throughout the testing process. A largely unexplored direction is sample complexity for constrained meta RL with provable safe exploration guarantees. To address this gap, we propose an algorithm that refines policies learned during training, with provable safe exploration and sample complexity guarantees for learning a near optimal policy. We further derive a matching lower bound, showing that this sample complexity is tight. We validate our approach in a gridworld environment, where it outperforms prior constrained RL and constrained meta RL methods in learning efficiency while ensuring safe exploration.