Latent Space Refusal Anchoring for Low-Resource African Languages: Mechanistic Safety Recovery Without Retraining
Abstract
Instruction-tuned models that refuse harmful requests in English comply with the same requests in Yoruba, Igbo, Igala, and Hausa. The refusal mechanism is present in the residual stream but fails to activate for low-resource inputs. Recovering it normally requires labelled data in each target language and retraining, neither of which is available at scale for most African languages. We introduce Latent Space Refusal Anchoring (LSR-Anchoring), a training-free method that takes the refusal direction computed from English prompts and clamps it onto the residual stream at inference time. The primary variant, Mean-Activation Steering (MAS), operates across all four architectures we tested: Llama-3-8B, Llama-3.1-70B, Mistral-7B-Instruct, and Qwen2.5-7B. On Mistral and Qwen it recovers safety with benign degradation below 0.08. On Llama-3-8B it overcorrects: Degraded Performance on Legitimate prompts (DPL) reaches 1.00, meaning every benign prompt is refused. We address this with SAE-Derived Steering (SDS), which uses a single Sparse Autoencoder (SAE) feature in place of a dense mean-difference direction, reducing Kullback–Leibler (KL) divergence by 3.5–7× with no benign collapse. Four languages transfer positively. Arabic does not, on any architecture, at any steering magnitude. On Llama-3, Arabic's unsteered refusal rate is already 80–90%; on Qwen2.5-7B it is 11% and still fails. It is a geometric failure and not a baseline problem. Massive Multitask Language Understanding (MMLU) accuracy drops remain below 0.35 percentage points at every effective steering magnitude.