GXPO: Group Cross-Lingual Relative Policy Optimization for Code Generation
Abstract
Current reinforcement learning (RL) methods for code generation are predominantly optimized on Python, showing weak generalization to other programming languages (PLs). Although leveraging multilingual solutions offers richer semantics and a wider search landscape, naive independent training across languages suffers from optimization imbalance and fails to effectively transfer knowledge from high-resource languages. We propose Group Cross-lingual Relative Policy Optimization (GXPO), which forms training groups by generating solutions for the same problem in multiple PLs and jointly optimizes language-specific and cross-language signals, enabling more balanced optimization and improved transfer to low-resource PLs. We additionally introduce Multilingual LiveCodeBench (ML-LCB), extending LiveCodeBench to a unified multilingual evaluation setting. On ML-LCB across 8 PLs, GXPO consistently improves performance, with pronounced gains on low-resource PLs, demonstrating scalable multilingual RL for language-consistent code generation.