Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Next Generation of AI Safety

Catastrophic Goodhart: regularizing RLHF with KL divergence does not mitigate heavy-tailed reward misspecification

Thomas Kwa · Drake Thomas · Adrià Garriga-Alonso

Abstract

Video

Chat is not available.