Reliability-Aware LLM Alignment from Inconsistent Human Feedback
Jingyi Huang ⋅ Ruohan Zong ⋅ Yujun Feng ⋅ Liran Ma ⋅ Lanyu Shang ⋅ Yang Zhang
Abstract
Reinforcement Learning from Human Feedback (RLHF) is critical for aligning Large Language Models (LLMs) with human preferences. However, its efficacy is often compromised by the inherent inconsistency and subjectivity of human annotations. Existing preference optimization frameworks, such as Direct Preference Optimization (DPO), typically treat ambiguous pairs with high annotator disagreement identically to those with unanimous consensus, forcing models to overfit to inconsistent supervision signals and leading to suboptimal alignment. In this work, we propose $\textit{Reliability-Guided Preference Optimization}$ (RGPO), a robust framework designed to mitigate the impact of inconsistent human feedback. RGPO estimates annotator reliability and infers latent ground truth labels from noisy human feedback to identify robust preferences. Furthermore, we introduce a reliability-aware consistency optimization that dynamically modulates the training objective based on the consensus level of annotations, ensuring the model prioritizes high-consensus supervision signals. Extensive experiments on LLM alignment benchmarks demonstrate that RGPO effectively reduces inconsistency and noise in training data and achieves superior performance compared to widely adopted RLHF baselines.
Successful Page Load