Skip to yearly menu bar Skip to main content


Poster Wed, Jul 8, 2026 • 5:00 PM – 6:45 PM KST HALL A

Alignment Tampering: How Reinforcement Learning from Human Feedback Is Exploited to Optimize Misaligned Biases

Dongyoon Hahm ⋅ Dylan Hadfield-Menell ⋅ Kimin Lee

Abstract

Log in and register to view live content