Graph-Preference Learning: Debiasing Network-Sampled Human Feedback for Target Welfare Estimation
Guangrui Fan ⋅ DanDan Liu ⋅ AZNUL SABRI ⋅ Pan Lihu
Abstract
Preference-based reward modeling is a core component of RLHF and DPO pipelines. In practice, the humans providing preference feedback are rarely an i.i.d. sample: recruitment and exposure often follow social, institutional, or spatial structure, inducing non-uniform inclusion probabilities that correlate with graph centrality. We formalize preference learning with *network-sampled* annotators and show that identity-agnostic scalar reward modeling implicitly represents an inclusion-weighted welfare, over-representing structurally central communities when the inclusion distribution $q$ differs from a designer-chosen target weighting $\pi$. We propose Graph-Preference Learning, which combines (i) a graph-personalized reward model that shares statistical strength across neighboring annotators and (ii) graph-balanced aggregation that computes stabilized importance weights to target $\pi$. Our analysis characterizes the induced welfare represented by the learned aggregate reward and bounds its deviation from the target in terms of weight mismatch, reward-model approximation, and finite-sample effects. Experiments on synthetic graphs and a *semi-synthetic* case study on the LMArena preference dataset, where biased inclusion is *induced* via graph-based sampling, demonstrate up to 62% reduction in target-welfare recovery error and 17% reduction in cross-language performance gaps under biased inclusion.
Successful Page Load