Skip to yearly menu bar Skip to main content


Poster

Feedback Loops With Language Models Drive In-Context Reward Hacking

Alexander Pan · Erik Jones · Meena Jagadeesan · Jacob Steinhardt
2024 Poster

Abstract

Chat is not available.