Skip to yearly menu bar Skip to main content


Poster

One Bias After Another: Mechanistic Reward Shaping and Persistent Biases in Language Reward Models

Daniel Fein ⋅ Max Lamparth ⋅ Violet Xiang ⋅ Mykel Kochenderfer ⋅ Nick Haber

Abstract

Log in and register to view live content