Skip to yearly menu bar Skip to main content


Oral

Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the Machiavelli Benchmark

Alexander Pan ⋅ Jun Shern Chan ⋅ Andy Zou ⋅ Nathaniel Li ⋅ Steven Basart ⋅ Thomas Woodside ⋅ Hanlin Zhang ⋅ Scott Emmons ⋅ Dan Hendrycks
2023 Oral
[ PDF

Abstract

Video

Chat is not available.