ICML Exploiting Action Distances for Reward Learning from Human Preferences

Poster
in
Workshop: The Many Facets of Preference-Based Learning

Exploiting Action Distances for Reward Learning from Human Preferences

Mudit Verma · Siddhant Bhambri · Subbarao Kambhampati

[ Abstract ] [ Project Page ]

[ OpenReview]

Abstract:

Preference-based Reinforcement Learning (PbRL) with binary preference feedbacks over trajectory pairs has proved to be quite effective in learning complex preferences of a human in the loop in domains with high dimensional state spaces and action spaces. While the human preference is primarily inferred from the feedback provided, we propose that, in situations where the human preferences are goal-oriented, the policy being learned (jointly with the reward model) during training can also provide valuable learning signal about the probable goal based on the human preference. To utilize this information, we introduce an action distance measure based on the policy and use it as an auxiliary prediction task for reward learning. This measure not only provides insight into the transition dynamics of the environment but also informs about the reachability of states under the policy by giving a distance to goal measure. We choose six tasks with goal-oriented preferences in the Meta-World domains to evaluate the performance and sample efficiency of our approach. We show that our approach outperforms methods leveraging auxiliary tasks of learning environment dynamics or a non-temporal distance measure adapted by PbRL baselines. Additionally, we show that action distance measure can also accelerate policy learning which is reaffirmed by our experimental results.

Chat is not available.

Poster in Workshop: The Many Facets of Preference-Based Learning

Exploiting Action Distances for Reward Learning from Human Preferences

Mudit Verma · Siddhant Bhambri · Subbarao Kambhampati

Poster
in
Workshop: The Many Facets of Preference-Based Learning