Skip to yearly menu bar Skip to main content


Teaching Models to Verbalize Reward Hacking in Chain-of-Thought Reasoning

Miles Turpin ⋅ Andy Arditi ⋅ Marvin Li ⋅ Joe Benton ⋅ Julian Michael

Abstract

Chat is not available.