Skip to yearly menu bar Skip to main content


Poster

Alignment Tampering: How Reinforcement Learning from Human Feedback Is Exploited to Optimize Misaligned Biases

Dongyoon Hahm ⋅ Dylan Hadfield-Menell ⋅ Kimin Lee

Abstract

Log in and register to view live content