Skip to yearly menu bar Skip to main content


Poster Wed, Jul 8, 2026 • 1:00 AM – 2:45 AM PDT HALL A #3206

Alignment Tampering: How Reinforcement Learning from Human Feedback Is Exploited to Optimize Misaligned Biases

Dongyoon Hahm ⋅ Dylan Hadfield-Menell ⋅ Kimin Lee

Abstract

Log in and register to view live content