Regularization in the Axiomatic Approach to Learning from Human Preferences
Abstract
Reinforcement learning from human feedback is the leading approach to aligning powerful AI systems so that they can be safe and helpful for humanity. While RLHF is typically modelled as a problem of learning a single preference ranking from noisy feedback, true human preferences are complex and often conflicting, representing substantive disagreements stemming from the diversity of individual human values. With this motivation, a recent line of research has studied RLHF from the perspective of social choice theory, which provides a set of well-established desirable properties for aggregating diverse preferences. Seen through this lens, the standard learning objective in RLHF is equivalent to aggregating diverse human preferences via the Borda count rule. At the same time, several new RLHF algorithms have been proposed, which turn out to be equivalent to the von Neumann winner social choice rule. However, the connection between social choice theory and RLHF has thus far ignored the critical role of regularization to prevent divergence from a reference policy, which is utilized in essentially all practical RLHF algorithms. In this paper, we study how regularization affects the social choice axioms satisfied by different RLHF algorithms, and prove that regularization improves the axiomatic properties of the von Neumann winner rule. In contrast, the Borda count rule still fails to satisfy key social choice axioms even when regularized. These results provide a principled argument grounded in social choice theory for utilizing practical RLHF algorithms that correspond to the von Neumann winner, rather than the standard RLHF objective.