Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Next Generation of AI Safety

AdaptiveBackdoor: Backdoored Language Model Agents that Detect Human Overseers

Heng Wang ⋅ Ruiqi Zhong ⋅ Jiaxin Wen ⋅ Jacob Steinhardt

Abstract

Video

Chat is not available.