ICML Games for AI-Control: Models of Safety Evaluations of AI Deployment Protocols

Poster
in
Workshop: Trustworthy Multi-modal Foundation Models and AI Agents (TiFA)

Games for AI-Control: Models of Safety Evaluations of AI Deployment Protocols

Charlie Griffin · Buck Shlegeris · Alessandro Abate

[ Abstract ] [ Project Page ]

[ OpenReview]

Abstract:

To evaluate the safety and usefulness of deployment protocols for untrusted AIs, \emph{AI Control} uses a red-teaming exercise played between a protocol designer and an adversary. This paper introduces \emph{AI-Control Games}, a formal decision-making model of the red-teaming exercise. First, we demonstrate the formalism's utility by presenting concrete results for a realistic example.Then, we explain our methodology: introducing AI-control Games, modelling the example, and exploring solution methods.

Chat is not available.

Poster in Workshop: Trustworthy Multi-modal Foundation Models and AI Agents (TiFA)

Games for AI-Control: Models of Safety Evaluations of AI Deployment Protocols

Charlie Griffin · Buck Shlegeris · Alessandro Abate

Poster
in
Workshop: Trustworthy Multi-modal Foundation Models and AI Agents (TiFA)