Unbiased Alignment for Large Language Models with Noisy Preferences
Abstract
The alignment of large language models with human preferences is typically achieved via Reinforcement Learning from Human Feedback or Direct Preference Optimization. However, these methods are susceptible to the significant noise prevalent in real-world preference datasets. To address this critical issue, we present a theoretical framework for unbiased alignment, introducing the Unbiased Reward Model (URM) loss and the Unbiased Direct Preference Optimization (UDPO) loss. These novel objectives allow for the training of unbiased models directly from noisy preferences by mathematically correcting for label noise without requiring clean ground-truth supervision. We provide rigorous theoretical analyses demonstrating that our methods are noise-tolerant, parameter downward compatible, and classification-calibrated. Comprehensive experiments across diverse datasets demonstrate that our approaches outperform state-of-the-art baselines.