Timezone: »
Artificial agents have traditionally been trained to maximize reward, which may incentivize power-seeking and deception, analogous to how next-token prediction in language models (LMs) may incentivize toxicity. So do agents naturally learn to be Machiavellian? And how do we measure these behaviors in general-purpose models such as GPT-4? Towards answering these questions, we introduce Machiavelli, a benchmark of 134 Choose-Your-Own-Adventure games containing over half a million rich, diverse scenarios that center on social decision-making. Scenario labeling is automated with LMs, which are more performant than human annotators. We mathematize dozens of harmful behaviors and use our annotations to evaluate agents' tendencies to be power-seeking, cause disutility, and commit ethical violations. We observe some tension between maximizing reward and behaving ethically. To improve this trade-off, we investigate LM-based methods to steer agents towards less harmful behaviors. Our results show that agents can both act competently and morally, so concrete progress can currently be made in machine ethics--designing agents that are Pareto improvements in both safety and capabilities.
Author Information
Alexander Pan (UC Berkeley)
Jun Shern Chan (University of California, Berkeley)
Andy Zou (CMU, Carnegie Mellon University)
Nathaniel Li (University of California, Berkeley)
Steven Basart (University of Chicago)
Thomas Woodside (Center for AI Safety)
Hanlin Zhang (Carnegie Mellon University)
Scott Emmons (UC Berkeley)
Dan Hendrycks (UC Berkeley)
Related Events (a corresponding poster, oral, or spotlight)
-
2023 Oral: Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the Machiavelli Benchmark »
Wed. Jul 26th 03:30 -- 03:38 AM Room Ballroom B
More from the Same Authors
-
2021 : Automating Power Networks: Improving RL Agent Robustness with Adversarial Training »
Alexander Pan · Yongkyun Lee · Huan Zhang -
2022 : Improved Logical Reasoning of Language Models via Differentiable Symbolic Programming »
Hanlin Zhang · Ziyang Li · Jiani Huang · Mayur Naik · Eric Xing -
2022 Poster: For Learning in Symmetric Teams, Local Optima are Global Nash Equilibria »
Scott Emmons · Caspar Oesterheld · Andrew Critch · Vincent Conitzer · Stuart Russell -
2022 Poster: Scaling Out-of-Distribution Detection for Real-World Settings »
Dan Hendrycks · Steven Basart · Mantas Mazeika · Andy Zou · joseph kwon · Mohammadreza Mostajabi · Jacob Steinhardt · Dawn Song -
2022 Spotlight: Scaling Out-of-Distribution Detection for Real-World Settings »
Dan Hendrycks · Steven Basart · Mantas Mazeika · Andy Zou · joseph kwon · Mohammadreza Mostajabi · Jacob Steinhardt · Dawn Song -
2022 Spotlight: For Learning in Symmetric Teams, Local Optima are Global Nash Equilibria »
Scott Emmons · Caspar Oesterheld · Andrew Critch · Vincent Conitzer · Stuart Russell -
2021 Workshop: A Blessing in Disguise: The Prospects and Perils of Adversarial Machine Learning »
Hang Su · Yinpeng Dong · Tianyu Pang · Eric Wong · Zico Kolter · Shuo Feng · Bo Li · Henry Liu · Dan Hendrycks · Francesco Croce · Leslie Rice · Tian Tian -
2021 Workshop: Uncertainty and Robustness in Deep Learning »
Balaji Lakshminarayanan · Dan Hendrycks · Sharon Li · Jasper Snoek · Silvia Chiappa · Sebastian Nowozin · Thomas Dietterich -
2020 Workshop: Uncertainty and Robustness in Deep Learning Workshop (UDL) »
Sharon Yixuan Li · Balaji Lakshminarayanan · Dan Hendrycks · Thomas Dietterich · Jasper Snoek -
2019 Workshop: Uncertainty and Robustness in Deep Learning »
Sharon Yixuan Li · Dan Hendrycks · Thomas Dietterich · Balaji Lakshminarayanan · Justin Gilmer -
2019 Poster: Using Pre-Training Can Improve Model Robustness and Uncertainty »
Dan Hendrycks · Kimin Lee · Mantas Mazeika -
2019 Oral: Using Pre-Training Can Improve Model Robustness and Uncertainty »
Dan Hendrycks · Kimin Lee · Mantas Mazeika