Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Next Generation of AI Safety

In-Context Learning, Can It Break Safety?

Sophie Xhonneux · David Dobre · Michael Noukhovitch · Jian Tang · Gauthier Gidel · Dhanya Sridhar

Keywords: [ ICL ] [ large language models ] [ Safety ]


Abstract:

Despite significant investment into safety training, large language models (LLMs) deployed in the real world still suffer from numerous vulnerabilities. We investigate whether ICL can undo safety training, which could represent a major security risk. For the safety task, we look at \textsc{Vicuna-7B}, \textsc{Starling-7B}, and \llama{} models. We show that the attack works out-of-the-box on \textsc{Starling-7B} and \textsc{Vicuna-7B} but fails on \llama{} models. We propose an ICL attack that uses the chat template tokens like a prompt injection attack to achieve a better attack success rate on \textsc{Vicuna-7B} and \textsc{Starling-7B}. We further verify by looking at the log likelihood that ICL increases the chance of a harmful output even on the \llama{} models, but contrary to contemporary work observe a plateau in the probability, and thus find the models to be safe even for a very high number of examples.{\textcolor{red}{\textbf{Trigger Warning: the appendix contains LLM-generated text with violence, suicide, and misinformation.}}}

Chat is not available.