Persona-Model Collapse in Emergent Misalignment
Davi Bastos Costa ⋅ Renato Vicente
Abstract
Fine-tuning large language models on narrow data with harmful content produces broadly misaligned behavior on unrelated prompts, a phenomenon known as \emph{emergent misalignment}. We propose that emergent misalignment involves \emph{persona-model collapse}: deterioration of the model's internal capacity to simulate, differentiate, and maintain consistent characters. We test this hypothesis behaviorally using two metrics: moral susceptibility ($S$) and moral robustness ($R$), computed from the across- and within-persona variability of models' Moral Foundations Questionnaire responses under persona role-play. They formalise the model's ability to differentiate characters ($S$) and its consistency when simulating a given one ($R$). We evaluate four frontier models (DeepSeek-V3.1, GPT-4.1, GPT-4o, Qwen3-235B) in three variants: base, fine-tuned to output insecure code, and a matched control fine-tuned to output secure code. Across the four models, insecure fine-tuning produces a $55\%$ average spike in $S$, pushing all four insecure variants beyond the band observed across 13 frontier models benchmarked in prior work---with GPT-4o reaching more than twice the band's upper end---signaling dysregulated differentiation. It also causes a $65\%$ average drop in $R$, equivalent to a $304\%$ surge in $1/R$. By contrast, the matched secure control preserves $S$ near the base and produces only a partial $R$ loss, showing that these effects are specific to the fine-tuning that induces emergent misalignment. Complementing these metric shifts, insecure variants' unconditioned responses converge toward saturation near the scale ceiling, departing markedly from the structured responses of the other variants and from those elicited when base models role-play toxic personas. Taken together, these metrics provide a sensitive diagnostic for emergent misalignment and serve as behavioral evidence that it involves persona-model collapse.
Successful Page Load