Multilingual Safety Alignment via Representation-Space Separability
Abstract
Large language models (LLMs) have been globally adopted in various scenarios, making robust multilingual safety alignment a prerequisite for their reliable deployment across diverse languages. Despite recent advances, LLMs exhibit a substantial safety gap between high- and low-resource languages: models that can consistently refuse harmful requests in high-resource languages often fail to do so in low-resource languages. In this work, we reveal that such safety failures stem from insufficient representation-space separability between harmful and harmless prompts in low-resource languages. Through geometric analyses, we find that, compared to English, harmful prompts are significantly less separated from the manifold of harmless prompts, and that the resulting cross-lingual spatial margin gap is strongly correlated with attack success rates. Capitalizing on these insights, we propose Multilingual Spatial Margin Gap-based Optimization (SMO), a novel training strategy that exploits the well-aligned safety geometry of a dominant language (e.g., English) to enhance safety alignment in other languages. SMO explicitly leverages the spatial margin gap between English and target languages as an example-wise supervision signal, enabling effective cross-lingual transfer of safety capabilities while preserving the dominant language’s original performance. Experiments conducted on LLaMA-3.1-8B-Instruct and Qwen2.5-7B-Instruct demonstrate that SMO is capable of substantially reducing attack success rates in low-resource languages to near zero, often reaching zero, while maintaining strong general multilingual performance. Warning: This paper contains content that may be harmful.