Poster Wed, Jul 8, 2026 • 5:00 PM – 6:45 PM KST Coex: HALL A

Unbiased Alignment for Large Language Models with Noisy Preferences

Jialiang Wang ⋅ Xianming Liu ⋅ Xiong Zhou ⋅ Hui Liu ⋅ Haoliang Li

Abstract

The alignment of large language models with human preferences is typically achieved via Reinforcement Learning from Human Feedback or Direct Preference Optimization. However, these methods are susceptible to the significant noise prevalent in real-world preference datasets. To address this critical issue, we present a theoretical framework for unbiased alignment, introducing the Unbiased Reward Model (URM) loss and the Unbiased Direct Preference Optimization (UDPO) loss. These novel objectives allow for the training of unbiased models directly from noisy preferences by mathematically correcting for label noise without requiring clean ground-truth supervision. We provide rigorous theoretical analyses demonstrating that our methods are noise-tolerant, parameter downward compatible, and classification-calibrated. Comprehensive experiments across diverse datasets demonstrate that our approaches outperform state-of-the-art baselines.