MESA: Improving MoE Safety Alignment via Decentralized Expertise
Abstract
Mixture-of-Experts (MoE) architectures have emerged as a popular paradigm for scaling Large Language Models (LLMs), enabling greater capacity with reduced computational cost by dynamically routing inputs to the most relevant experts based on learned patterns. However, this also introduces a critical vulnerability: Safety Sparsity, where safety capabilities concentrate in few experts, making them susceptible to adversarial bypassing. Meanwhile, conventional alignment methods apply uniform adaptations across all parameters, ignoring their functional differences and inadvertently degrading general utility. To address these challenges, we propose MESA (MoE Safety Alignment), a targeted alignment framework for MoE-based LLMs that strategically decentralizes safety responsibilities to maximize coverage while explicitly minimizing interference with general capabilities. Specifically, based on Optimal Transport (OT) theory, MESA operates through two mechanisms: (1) Expert Capacity Reallocation, which uses a transport cost matrix to distribute safety duties to the most cost-effective experts; and (2) Dynamic Routing Refinement, which constrains the router to ensure precise activation of these decentralized modules. Extensive experiments demonstrate that MESA achieves robust defensive performance against varied harmful benchmarks while preserving general helpfulness.