Position: Aggregate Preference Optimization Hides a Posterior Identifiability Failure for Pluralistic Alignment
Abstract
RLHF and related preference-optimization methods produce a single reward model from aggregate pairwise feedback collected across diverse users. We argue that this aggregation hides a structural posterior identifiability failure: when subgroups hold conflicting preferences, the joint posterior over (subgroup composition, within-subgroup preference) is not identifiable from aggregate pairwise data alone, regardless of sample size. The aggregate-fit reward model can achieve low training loss while assigning systematically incorrect predictions to within-subgroup preferences. We support the position with two empirical demonstrations on a Bradley-Terry preference model with K=2 subgroups holding opposing preferences over five attributes: under high-conflict subgroups, the aggregate-fit model collapses to a near-zero theta vector with 0.008 aggregate MSE but 0.06-0.07 within-subgroup prediction MSE (an 8x gap), and a sweep over subgroup diversity sigma_groups in [0, 3] shows the masking ratio (within-subgroup error / aggregate error) growing from 1.08 to 4.32. We propose a four-item disclosure standard for pluralistic alignment papers: subgroup composition disclosure, within-subgroup posterior reporting, identification gap quantification, and prior sensitivity to the assumed mixture. The standard makes the difference between "aggregate-fit" and "pluralism-faithful" empirically auditable.