Learning from Rating Distributions: Modeling Human Uncertainty for Better Alignment and Calibration
Abstract
Human-related tasks are inherently grounded in human interpretation rather than objective ground truth. Accordingly, many datasets collect Likert-scale ratings to capture graded human judgments. Nevertheless, standard training pipelines typically collapse these annotations either by averaging ratings across multiple annotators or by discretizing rating scales into discrete classes or Yes / No decisions. We argue that this operation induces misalignment between model predictions and the distribution of human judgments, leading to systematic miscalibration in subjective tasks. While prior work has focused on architectural or optimization-related causes, we show that the supervision target itself is a key and underexplored source of this miscalibration. We propose a simple yet effective alternative: directly learning from the empirical distribution of Likert-scale ratings, preserving both ordinal structure and inter-annotator disagreement. Across multiple subjective prediction tasks, this approach improves alignment with human judgment distributions, yields better-calibrated models, while maintaining competitive or even improved predictive performance compared to aggregation-based training. Notably, preserving annotation structure during training substantially reduces the need for post-hoc calibration.