Learning Reward–Cost Balance in Safe RL via Score-Based World Models
Abstract
Safe reinforcement learning (Safe RL) seeks to optimize long-term performance while ensuring adherence to safety constraints. However, most existing approaches address safety in a simplified manner, typically by linearly combining rewards and costs, which provides limited guidance when safety and performance interact in complex, nonlinear ways. We present USB-RL (Unsupervised Score-Balanced Reinforcement Learning), a model-based framework that learns implicit safety–performance preferences directly from experience. Our approach infers a monotone partial-order score through unsupervised pairwise comparisons of long-horizon outcomes, capturing nuanced trade-offs without relying on manually tuned cost weights. The learned score guides model-based policy optimization by dynamically balancing safety and performance, enabling flexible and adaptive multi-step planning in imagination-based control. Across diverse safety benchmarks, USB-RL achieves strong returns while substantially reducing safety violations, demonstrating stable and interpretable safety–performance trade-offs.