Moral Orientation and Calibration: Coupled in Human Annotators, Separable in Judge LLMs
Abstract
Judge LLMs are increasingly used as scalable evaluators in ranking, reward modeling, and benchmark evaluation, but their scalar scores do not reveal the judgment structure underlying an assessment. In pluralistic moral evaluation, a gap between humans and a Judge LLM may arise either from different moral dimensions or from miscalibrated weighting of the same dimensions. We formalize this contrast by decomposing judgments into moral orientation and moral calibration in a shared moral vector space. Moral Orientation Fit (MOF) measures directional similarity between category-level human demand vectors and Judge response vectors, whereas Vector RMSE measures axis-level differences in magnitude. Using a Measuring Hate Speech panel with 40 Judge LLMs, 50 target categories, and 522,292 sentence-level observations, we show that high orientation together with low calibration error is associated with the smallest alignment gaps. We further find that orientation and calibration are tightly coupled among human annotators but more separable among Judge LLMs. This framework turns scalar agreement into a structured diagnosis of alignment failure, distinguishing differences in moral evidence from errors in response strength and enabling axis-resolved auditing and context-aware model selection.