Scaling Arabic Safety Alignment by Regime Discovery
Abstract
Arabic safety alignment requires models to refuse unsafe prompts without blocking benign or sensitive safe Arabic use. We frame this challenge as movement along a selective refusal frontier, defined by benign refusal and unsafe refusal. Rather than treating alignment as a fixed pipeline, we allocate compute to diagnose the starting regime of each Arabic capable model and map how interventions shift its position on the frontier. Across five models and 130 runs evaluated on 12,077 AraSafe prompts, we find that pure refusal SFT collapses into blanket refusal, while mixed SFT reaches 90–93% unsafe refusal at 14–23% benign refusal only after careful tuning of ratio, ordering, and checkpoint selection. DPO and guards are not automatic upgrades: they shift the operating point and can improve one axis while harming the other. We therefore propose a regime based recipe: diagnose the base frontier, choose the lightest intervention appropriate for the regime, and apply later calibration only when it improves the target Arabic safety frontier.