Poster Tue, Jul 7, 2026 • 10:30 AM – 12:15 PM KST Coex: HALL A

Learning to Route Languages for Multilingual Preference Optimization

Geyang Guo ⋅ Hiromi Wakaki ⋅ Yuki Mitsufuji ⋅ Alan Ritter ⋅ Wei Xu

Abstract

Large language models (LLMs) are trained on heterogeneous multilingual corpora, yet existing preference optimization methods often implicitly restrict each training question to a single response language or rely on a fixed dominant language for supervision. We propose language-routed preference optimization (LRPO), an online preference optimization framework that treats language as a selectable variable rather than a fixed input constraint. LRPO elicits multilingual rollouts for each training question and integrates their relative quality into preference-based policy updates, increasing the diversity and informativeness of training signals under the fixed rollout budget. To adaptively determine which languages to explore, we introduce a trainable language router formulated as a multi-armed bandit, balancing exploration of underutilized languages with exploitation of more informative ones. Extensive experiments show that LRPO consistently improves multilingual performance, demonstrating that adaptive language routing enables effective cross-lingual knowledge exploitation for training.