Mitigating Translationese Bias in Multilingual LLM-as-a-Judge via Disentangled Information Bottleneck
Abstract
Large language models (LLMs) have emerged as a standard paradigm for automated multilingual evaluation, yet exhibit systematic biases. In this paper, we identify ``translationese bias'', in which LLMs systematically favor machine-translated text over human-authored references, and this bias is particularly pronounced in low-resource languages. We attribute this bias to spurious correlations with (a) strong latent manifold isomorphism with English and (b) high predictive confidence. To mitigate these issues, we present DIBJudge, a robust fine-tuning framework that decouples robust features and bias representations by explicitly isolating spurious attributes into a dedicated bias branch and penalizing mutual dependence to enforce disentanglement. In particular, we present a vector-quantized compression that ensures the robust representation retains minimal and sufficient judgment-critical information. Extensive evaluations on multilingual reward modeling benchmarks and a specially designed translationese bias evaluation suite demonstrate that DIBJudge outperforms strong baselines and effectively mitigates translationese bias.