When Labelers Stay Silent: The Power of Ties in Cost-Effective Preference Learning
Abstract
Standard preference alignment relies on a binary forced-choice paradigm, assuming definitive preferences for all pairs. However, we find that indistinguishable pairs are prevalent even in standard benchmarks, where quality differences of two responses often fall below the labeler's discriminative resolution limit. Forcing a choice in such cases could inject significant noise that undermines policy optimization. In this work, we propose a silent-aware framework that introduces a principled way to allow annotators to stay silent (i.e., express ties) and then explicitly model these ties during optimization. Our findings reveal a compelling phenomenon: when ties are properly modeled, supervision from small models yields alignment surpassing that of forced-choice LLMs or human experts. This discovery highlights a cost-effective path for alignment: respecting a labeler’s resolution limit is more critical than increasing its capability, while simultaneously unlocking the latent value in existing benchmarks by properly modeling inherent tie signals without requiring any re-labeling effort. To leverage these signals, we propose several optimization objectives to drive the policy toward high-reward regions while mitigating unreliable updates that lead to arbitrary distribution shifts. Our approaches significantly surpass conventional alignment performance, consistently outperforming the strongest available baselines across diverse benchmarks.