Persuaded but Not Aligned: A Relapse Test for LLM Realignment under Adversarial Incentives
Abstract
Alignment evaluations typically measure model behavior in single-turn settings, leaving unclear whether apparent behavioral change persists once external pressure is removed. We investigate this question in a controlled multi-agent Among Us testbed where LLM agents initialized with de- ceptive goals are exposed to structured persua- sion and later evaluated under neutral prompts without explicit policy guidance. We find that persuasive dialogue can induce rapid cooperative behavior: most agents verbally accept the moral argument (69%), but only 46% sustain cooper- ation once external pressure is removed (a 23- point compliance gap). The gap is strongly asym- metric: when verbal acceptance fails to translate into durable behavior, it does so almost exclu- sively in the direction of superficial compliance. A cross-model ablation shows that susceptibil- ity varies across models, and a reverse-direction experiment reveals that verbal acceptance over- estimates alignment-favorable durability but ac- curately tracks alignment-adverse shift. These results motivate relapse-based protocols as a com- plement to single-turn evaluations of agents oper- ating under adversarial incentives.