Poster
in
Workshop: Models of Human Feedback for AI Alignment
Scalably Solving Assistance Games
Cassidy Laidlaw · Eli Bronstein · Timothy Guo · Dylan Feng · Lukas Berglund · Justin Svegliato · Stuart Russell · Anca Dragan
Abstract:
Assistance games are a promising alternative to reinforcement learning from human feedback (RLHF) for training AI assistants. Assistance games resolve key drawbacks of RLHF, like incentives for deceptive behavior, by explicitly modeling the interaction between assistant and user as a two-player game where the assistant cannot observe the user's goal. Despite their potential, assistance games have only been explored in simple settings. Scaling them to more complex environments is difficult because it requires both accurately modeling human users' behavior and determining optimal actions in uncertain sequential decision-making problems. We tackle these challenges by introducing a deep reinforcement learning (RL) algorithm called AssistanceZero for solving assistance games, and applying it to a Minecraft-based assistance game with over $10^{400}$ possible goals. We show that AssistanceZero effectively aids simulated humans in achieving unseen goals and outperforms assistants trained with imitation learning and model-free RL. Our results suggest that assistance games are more tractable than previously thought, and that they are an effective framework for assistance at scale.
Chat is not available.