In-Training Defenses Against Emergent Misalignment in Language Models
David Kaczér ⋅ Magnus Jørgenvåg ⋅ Clemens Vetter ⋅ Esha Afzal ⋅ Robin Haselhorst ⋅ Lucie Flek ⋅ Florian Mai
Abstract
Fine‑tuning lets practitioners repurpose aligned large language models (LLMs) for new domains, yet recent work reveals emergent misalignment (EMA): Even a small, domain‑specific fine‑tune can induce harmful behaviors far outside the target domain. Even in the case where model weights are hidden behind a fine-tuning API, this gives attackers inadvertent access to a broadly misaligned model in a way that can be hard to detect from the fine-tuning data alone. We present the first systematic study of *in‑training* safeguards against EMA that are practical for providers who expose fine‑tuning via an API: We evaluate whether they a) prevent broad misalignment, b) allow narrow misalignment, c) learn well on benign tasks, and d) remain coherent. We investigate four training regularization interventions: (i) KL‑divergence regularization toward a safe reference model, (ii) $\ell_2$ distance in feature space, (iii) preventative steering with an evil persona vector, and (iv) interleaving training examples from a general instruct-tuning dataset. We demonstrate that selecting interleaving data by the perplexity gap between aligned and misaligned models yields the best results overall.
Successful Page Load