Skip to yearly menu bar Skip to main content


AdaptiveBackdoor: Backdoored Language Model Agents that Detect Human Overseers

Heng Wang · Ruiqi Zhong · Jiaxin Wen · Jacob Steinhardt

Abstract

Chat is not available.