Skip to yearly menu bar Skip to main content


Oral
in
Workshop: The Many Facets of Preference-Based Learning

Learning Optimal Advantage from Preferences and Mistaking it for Reward

William Knox · Stephane Hatgis-Kessell · Sigurdur Adalgeirsson · Serena Booth · Anca Dragan · Peter Stone · Scott Niekum


Abstract:

Most recent work that involves learning reward functions from human preferences over pairs of trajectory segments---as used in reinforcement learning from human feedback (RLHF), including for ChatGPT and many contemporary language models---are built with the assumption that such human preferences are generated based only upon the reward accrued within those segments, which we call their partial return.But if this assumption is false because people base their preferences on information other than partial return, then what type of function is their algorithm learning from preferences? We argue that this function is better thought of as an approximation of the optimal advantage function, not a reward function as previously believed.

Chat is not available.