Skip to yearly menu bar Skip to main content


Poster

Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the Machiavelli Benchmark

Alexander Pan ⋅ Jun Shern Chan ⋅ Andy Zou ⋅ Nathaniel Li ⋅ Steven Basart ⋅ Thomas Woodside ⋅ Hanlin Zhang ⋅ Scott Emmons ⋅ Dan Hendrycks
2023 Poster

Abstract

Video

Chat is not available.