Oral Tue, Jul 7, 2026 • 6:30 PM – 6:45 PM PDT AUDITORIUM

Robust Harmful Features Under Jailbreak Attacks: Mechanistic Evidence from Attention Head Specialization in Large Language Models

Yanchen Yin ⋅ Dongqi Han ⋅ Linghui Li

Abstract

Jailbreak attacks bypass LLM safety alignment, yet their mechanisms remain poorly understood. We provide evidence that attacks do not comprehensively eliminate safety features, but instead selectively suppress specific attention heads. We identify two functionally differentiated types: Adversarially Compromised Heads (ACHs) concentrated in early layers, which are suppressed under attacks, and Safety-Aligned Heads (SAHs) in mid-layers, which maintain robust activations even when attacks succeed. Ablation studies support the causal role of ACHs and the contribution of SAHs to robust activations: suppressing a small number of ACHs is sufficient to induce jailbreak-like behavior on normally refused inputs, while removing SAHs substantially weakens mid-layer safety activations. Token-level attribution further shows that ACH suppression is driven specifically by attack-template tokens, providing a mechanistic account of why attacks can bypass refusal decisions through ACH suppression while leaving internal safety signals sustained by SAHs—a phenomenon we term Robust Harmful Features. To validate the practical significance of this robustness, we show that simply reading these persistent activations—without any training—yields competitive aggregate detection performance with strong adversarial robustness.

Lay Summary

Large language models are trained to refuse harmful requests, but jailbreak prompts can sometimes trick them into giving unsafe answers. This paper studies what happens inside such models when a jailbreak succeeds. We find that jailbreaks do not fully remove the model’s internal recognition of harmful content. Instead, they mainly weaken a small set of internal components that are important for triggering refusal, while other components continue to recognize that the request is harmful. This helps explain why a model may still “know” a request is unsafe internally but fail to refuse it in its final response. Based on this finding, we also build a simple detector that reads these remaining internal warning signals without additional training. The results suggest that understanding how jailbreaks affect model internals can help design better safety monitoring and defenses.