Poster
in
Workshop: Models of Human Feedback for AI Alignment
Scalable Oversight by Accounting for Unreliable Feedback
Shivam Singhal · Cassidy Laidlaw · Anca Dragan
Fri 26 Jul midnight PDT — 8 a.m. PDT
Reward functions learned from human feedback serve as the training objective for RLHF, the current state-of-the-art approach for aligning large language models to our values; however, in practice, these reward models fail to robustly capture our desiderata. For instance, they often place more weight on the length of the output or agreement with the user and less on important features like factual correctness. A major reason behind these shortcomings of learned reward functions is the fact that human annotator feedback on which the models are trained is unreliable. Due to knowledge gaps, limited resources, cognitive biases, or other factors, annotators may not be able to accurately judge the model's outputs, and thus, their feedback may not be reliably aligned with their true preferences. Current proposals to address the challenges posed by unreliable feedback include asking annotators only easy questions that they can easily answer, providing them with an AI assistant during evaluation, and relying primarily on AI feedback with limited human supervision (e.g., constitutional AI). However, it remains unclear how practical and scalable these approaches are. We identify a complementary strategy that can easily be incorporated into existing alignment methods (e.g., RLHF, DPO, etc.): explicitly modeling the annotators’ knowledge and judgment in order to better learn from unreliable feedback. In particular, we propose an adjustment to the Bradley-Terry model used in preference learning that accounts for how well an annotator’s feedback is expected to match their true values or preferences. We test our approach in a setting where annotators are likely to provide unreliable feedback, and we find that it results in preference models that assign higher value to important characteristics, like factuality, than existing methods.