Poster
in
Workshop: Models of Human Feedback for AI Alignment
Comparing Comparisons: Informative and Easy Human Feedback with Distinguishability Queries
Xuening Feng · Zhaohui Jiang · Timo Kaufmann · Eyke Hüllermeier · Paul Weng · Yifei Zhu
Learning human objectives from preference feedback has significantly advanced reinforcement learning (RL) in domains with hard-to-formalize objectives. Traditional methods with pairwise trajectory comparisons face challenges: trajectories with subtle differences are hard to compare, and comparisons are ordinal, limiting direct inference of preference strength. In this paper, we introduce the distinguishability query, where humans compare two pairs of trajectories and indicate which pair is easier to compare and then give preference feedback on the easier pair. This type of query directly infers preference strength and is expected to reduce cognitive load on the labeler. We also connect this query to cardinal utility and difference relations, and develop an efficient query selection scheme to achieve better trade-off between query informativeness and easiness. Experimental results empirically demonstrates the potential of our method for faster, data-efficient learning and improved user-friendliness on RLHF benchmarks.