How Jailbreaking 1-Layer Transformers Taught us how to Steer LLMs
Eric Wong
Abstract
Why are LLMs fundamentally so easily jailbroken? We will first analyze how to subvert and "jailbreak" simple, one-layer transformers, formalizing rule-breaking as a vulnerability in the model's attention mechanism. This theoretical work taught us a critical lesson: if attention is the key to breaking rules, it may also be the key to enforcing them. We will then introduce InstABoost, an incredibly simple yet highly effective steering method that directly applies this insight to boost the model's attention on user-provided instructions during generation. This technique, inspired by our theoretical understanding of how a small model is jailbroken, provides state-of-the-art control over large-scale LLMs with just five lines of code.
Video
Chat is not available.
Successful Page Load