Poster
in
Workshop: Models of Human Feedback for AI Alignment

Uncertainty-aware Preference Alignment in Reinforcement Learning from Human Feedback

Sheng Xu ⋅ Bo Yue ⋅ Hongyuan Zha ⋅ Guiliang Liu

2024 Poster
in
Workshop: Models of Human Feedback for AI Alignment

Project Page [ OpenReview]

Abstract

Recent advances in Reinforcement Learning from Human Feedback (RLHF) typically model a reward function by maximizing its likelihood of generating observed human preferences. However, due to the diverse backgrounds of individuals, these preference signals are inherently stochastic. This inherent uncertainty in the preference signals can lead to unstable or unsafe behaviors in the process of reward and policy updates. In this work, we introduce the uncertainty-aware preference alignment in RLHF by learning a distributional reward model and a risk-sensitive policy from the offline preference dataset. Specifically, we propose a Maximum A Posteriori (MAP) objective for updating the reward associated with a trajectory. This updating process incorporates an informative prior to account for the uncertainty in human preferences. Utilizing this updated reward sample, we develop a generative reward model to represent the reward distribution. Driven by the inherent stochasticity in the reward models, we utilize the offline distributional Bellman operator and the Conditional Value-at-Risk (CVaR) metric to learn a risk-sensitive policy from the offline dataset. Experimental results show that the risk-sensitive RLHF agent can effectively identify and avoid states with significant stochasticity, thereby enabling risk-averse control in different tasks.

Video

Chat is not available.