Skip to yearly menu bar Skip to main content


Oral

Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the Machiavelli Benchmark

Alexander Pan · Jun Shern Chan · Andy Zou · Nathaniel Li · Steven Basart · Thomas Woodside · Hanlin Zhang · Scott Emmons · Dan Hendrycks
2023 Oral
[ PDF

Abstract

Video

Chat is not available.