Online Compatible Reward Identification from Preference Feedback
Abstract
In reinforcement learning, human preference feedback is emerging as a viable alternative to expert-designed reward functions, which can be difficult to engineer in real-world problems. However, despite the growing importance of preference feedback, how to effectively elicit preferences remains a fundamental open problem. This work focuses on the compatible reward identification task. The aim is to derive, starting from preference feedback, a reward function compatible with the observed preferences and accurate across the entire state-action space, ensuring higher transferability, safety, and interpretability. Indeed, the most common reinforcement learning from human feedback objective is to learn the optimal policy, requiring accuracy only in the portion of the state-action space that the agent visits. However, this goal cannot provide the same guarantees as compatible reward identification. First, we discuss commonalities and differences between the two goals. Then, we consider deterministic preferences, deriving the minimum number of interactions needed to identify the set of compatible rewards, and showing that using fewer queries may lead to arbitrarily large suboptimality. Finally, we focus on stochastic preferences generated via the Bradley-Terry (BT) model. We introduce the concepts of query basis and its index, relating them to the problem complexity. Upon this, we discuss the connection between the index of a basis and the BT model, as well as the limitations that the model induces in this setting. Additionally, we devise an algorithm to identify a nearly-optimal query basis with polynomial human query complexity.