Human-in-the-Loop Policy Optimization for Preference-Based Multi-Objective Reinforcement Learning
Abstract
Multi-objective reinforcement learning (MORL) seeks policies that effectively balance conflicting objectives. However, presenting many diverse policies without accounting for the decision maker’s (DM’s) preferences can overwhelm the decision-making process. On the other hand, accurately specifying preferences in advance is often unrealistic. To address these challenges, we introduce a human-in-the-loop MORL framework that interactively discovers preferred policies during optimization. Our approach proactively learns the DM’s implicit preferences in real time, requiring no a priori knowledge. Importantly, we integrate this preference learning directly into a parallel optimization framework, balancing exploration and exploitation to identify high-quality policies aligned with the DM's preferences. Evaluations on a complex quadrupedal robot simulation environment demonstrate that, with only interactions, our proposed method can identify policies aligned with human preferences, e.g., running like a dog. Further experiments on seven MuJoCo tasks and a multi-microgrid system design task against eight state-of-the-art MORAL algorithms fully demonstrate the effectiveness of our proposed framework. Demonstrations and full experiments are in https://sites.google.com/view/pbmorl/home.