What Aggregate Accuracy Hides: Cultural Affective Inequity in Multilingual LLMs
Abstract
Multilingual LLMs may exhibit similar aggregate accuracy while failing differently across languages and emotions. To analyze cross-cultural disparities hidden by aggregate evaluation, we introduce two complementary disparity metrics: the \Cultural Inequity Score (CIS), which quantifies cross-linguistic disparity concentration, and the Emotion-Stratified Gap (ESG), which measures emotion-specific cross-cultural performance gaps. Evaluating four multilingual LLMs on the CEDAR benchmark across six languages, we identify three cases where aggregate evaluation obscures distinct cross-cultural failure patterns: similarly low Swahili accuracy can arise from fundamentally different failure modes across models; a multilingual-oriented model exhibits lower CIS than a larger general-purpose model despite lower aggregate accuracy; and happiness shows the largest cross-cultural ESG across all four models despite relatively high mean accuracy among 14 emotion categories. These results suggest that aggregate accuracy metrics alone may be insufficient for pluralistic affective evaluation, as important distributional patterns of cross-cultural failure can remain obscured without decomposition.