Skip to yearly menu bar Skip to main content


Teaching Models to Verbalize Reward Hacking in Chain-of-Thought Reasoning

Miles Turpin · Andy Arditi · Marvin Li · Joe Benton · Julian Michael

Abstract

Chat is not available.