Skip to yearly menu bar Skip to main content


Poster

In-Training Defenses Against Emergent Misalignment in Language Models

David Kaczér ⋅ Magnus Jørgenvåg ⋅ Clemens Vetter ⋅ Esha Afzal ⋅ Robin Haselhorst ⋅ Lucie Flek ⋅ Florian Mai

Abstract

Log in and register to view live content