Learning to Route Languages for Multilingual Preference Optimization
Abstract
Large language models (LLMs) are trained on heterogeneous multilingual corpora, yet existing preference optimization methods often implicitly restrict each training question to a single response language or rely on a fixed dominant language for supervision. We propose language-routed preference optimization (LRPO), an online preference optimization framework that treats language as a selectable variable rather than a fixed input constraint. LRPO elicits multilingual rollouts for each training question and integrates their relative quality into preference-based policy updates, increasing the diversity and informativeness of training signals under the fixed rollout budget. To adaptively determine which languages to explore, we introduce a trainable language router formulated as a multi-armed bandit, balancing exploration of underutilized languages with exploitation of more informative ones. Extensive experiments show that LRPO consistently improves multilingual performance, demonstrating that adaptive language routing enables effective cross-lingual knowledge exploitation for training.