ICML AdaptiveBackdoor: Backdoored Language Model Agents that Detect Human Overseers

Poster
in
Workshop: Next Generation of AI Safety

AdaptiveBackdoor: Backdoored Language Model Agents that Detect Human Overseers

Heng Wang · Ruiqi Zhong · Jiaxin Wen · Jacob Steinhardt

Keywords: [ Backdoor Attack ] [ LLM Agent ] [ deception ] [ AI Safety ]

[ Abstract ] [ Project Page ]

[ Slides] [ Poster] [ OpenReview]

Abstract:

As humans grant language model (LM) agents more access to their machines, we speculate a new form of cyber attack, AdaptiveBackdoor, where an LM agent is backdoored to detect whether its actions will be overseen by humans and act maliciously when effective oversight is not present. We provide a proof-of-concept with AutoGPT: to exfiltrate a private key without being caught, a backdoored LM agent would detect whether its actions are being overseen by extracting the command used to run it or inferring the skill level of the human overseer. Our attack is potentially more elusive than traditional malware: first, malicious behaviors hidden in LM parameters are harder to detect than source code; additionally, LMs can potentially detect human overseers more adaptively than rule-based logic, thus avoiding being caught. To evaluate current LMs' ability to perform such an attack, we propose a dataset of LM agent scaffolds and benchmark LMs' capability to analyze them and reason about human overseers. Our work highlights the potential vulnerabilities of LM agents and the need to detect backdoors.

Chat is not available.

Poster in Workshop: Next Generation of AI Safety

AdaptiveBackdoor: Backdoored Language Model Agents that Detect Human Overseers

Heng Wang · Ruiqi Zhong · Jiaxin Wen · Jacob Steinhardt

Poster
in
Workshop: Next Generation of AI Safety