ICML Poster AI Control: Improving Safety Despite Intentional Subversion

Poster

AI Control: Improving Safety Despite Intentional Subversion

Ryan Greenblatt · Buck Shlegeris · Kshitij Sachan · Fabien Roger

Hall C 4-9 #2417

[ Abstract ] [ Paper PDF ]

[ Poster]

Wed 24 Jul 4:30 a.m. PDT — 6 a.m. PDT

Oral presentation: Oral 4C Safety and Control
Wed 24 Jul 7:30 a.m. PDT — 8:30 a.m. PDT

Abstract:

As large language models (LLMs) become more powerful and are deployed more autonomously, it will be increasingly important to prevent them from causing harmful outcomes. To do so, safety measures either aim at making LLMs try to avoid harmful outcomes or aim at preventing LLMs from causing harmful outcomes, even if they try to cause them. In this paper, we focus on this second layer of defense. We develop and evaluate pipelines of safety techniques (protocols) that try to ensure safety despite intentional subversion - an approach we call AI control. We investigate a setting in which we want to solve a sequence of programming problems without ever submitting subtly wrong code, using access to a powerful but untrusted model (in our case, GPT-4), access to a less powerful trusted model (in our case, GPT-3.5), and limited access to high-quality trusted labor. We investigate a range of protocols and red-team them by exploring strategies that the untrusted model could use to subvert them. We find that using the trusted model to edit untrusted-model code or using the untrusted model as a monitor substantially improves on simple baselines.

Chat is not available.