Poster
in
Workshop: Pluralistic Alignment Workshop

Majority Vote Silences Minority Values: Annotator Disagreement at the Hate/Offensive Boundary in HateXplain

JOSHUA MUHUMUZA ⋅ Joab E Agaba ⋅ Mercy R Amiyo

Project Page

Abstract

Hate speech annotation pipelines routinely collapse annotator disagreement into majority vote labels before training. We show that this aggregation is not neutral: 42.6% of all annotator disagreement in HateXplain concentrates specifically at the hate/offensive boundary, where annotators differ not because one is wrong but because they hold different values about where hate begins ($\chi^{2} = 135.199$, $df = 2$, $p < 0.0001$). Both a hard-label BERT model (Model A) and a soft-label model (Model B) drop 22 percentage points in accuracy from agreed posts ($\sim$80%) to disagreement posts ($\sim$58%), confirmed at $p < 0.0001$. A per-annotator multi-head model (Model C) widens this gap further to 28 points while collapsing offensive disagreement accuracy to 0.245. Critically, Model A expresses significantly higher confidence on boundary case errors than Model C (0.710 vs. 0.495, $p < 0.0001$), meaning standard evaluation metrics will not detect the failure. Three downstream interventions of increasing sophistication all fail to recover boundary accuracy. We argue the problem is structural. Majority vote presents a contested cultural value judgment as ground truth, and models inherit that false certainty. The intervention must be upstream in annotation design.