Steering Beyond the Support: Adversarial Training on Unsupervised Jailbroken Activation Simulation
Abstract
Jailbreak prompts can trigger harmful comple- tions on aligned LLMs, In accordance, safety steering has been proposed: test-time activation interventions that steer jailbreak activations to trig- ger refusal while preserving benign utility. How- ever, existing steering methods are fundamentally supervised and tied to a static, limited training set, whereas real jailbreaks evolve and are often out-of-distributed from the training set, leading to failures on unseen attacks. In this paper, we tackle the failure on unseen jail- breaks problem, base on unsupervised latent di- rection discovery. We propose a bi-level adver- sarial training framework for zero-shot jailbreak defense. In the inner step, we simulate diverse jail- broken activations by extrapolating from refusal- state harmful-request activations via unsupervised latent direction discovery, which expands the cov- erage of real jailbreak activation subspaces. In the outer step, we train a potential-induced steering field to push these adversarial jailbroken states into refusal regions while keeping benign un- changed. Across three LLMs and six classical jailbreak families, our method achieves strong de- fense with attack success rates mostly below 5%, and rising subspace coverage throughout training helps explain the improved generalization.