The Sign Estimator: Preference Modeling for LLM Alignment under Heterogeneity
Aymane El Gadarri ⋅ Ali Aouad ⋅ Vivek Farias
Abstract
Traditional large language model (LLM) alignment methods are based on Reinforcement Learning From Human Feedback (RLHF), which learns a single reward model (implicitly or explicitly) from pairwise comparison data. This approach implicitly assumes homogeneous preferences across human labelers---an assumption that is violated in practice. As a result, the learned reward model in RLHF is generally misspecified: Prior work shows that it is inconsistent with the population-average utility, incurring large distortion, and that recovering this utilitarian objective is provably impossible in the worst case. In this work, we show that the average utility is recoverable under a mild assumption. Our method, the Sign Estimator, simply replaces the standard cross-entropy loss function with a notion of binary classification loss and yields a reward model that is ordinally consistent with the population-average utility. We further establish a fast finite-sample convergence rate of $O({n^{-{1}/{3}}})$ , which provides, to our knowledge, the first consistent estimator for heterogeneous preferences that does not suffer from the curse of dimensionality.
Successful Page Load