Poster
in
Workshop: Trustworthy Multi-modal Foundation Models and AI Agents (TiFA)
Games for AI-Control: Models of Safety Evaluations of AI Deployment Protocols
Charlie Griffin · Buck Shlegeris · Alessandro Abate
Abstract:
To evaluate the safety and usefulness of deployment protocols for untrusted AIs, \emph{AI Control} uses a red-teaming exercise played between a protocol designer and an adversary. This paper introduces \emph{AI-Control Games}, a formal decision-making model of the red-teaming exercise. First, we demonstrate the formalism's utility by presenting concrete results for a realistic example.Then, we explain our methodology: introducing AI-control Games, modelling the example, and exploring solution methods.
Chat is not available.