Skip to yearly menu bar Skip to main content


Poster

Endogenous Resistance to Activation Steering in Language Models: Evidence for Internal Consistency Monitoring in Llama-3.3-70B

Alex McKenzie ⋅ Keenan Pepper ⋅ Stijn Servaes ⋅ Martin Leitgab ⋅ Murat Cubuktepe ⋅ Michael Vaiana ⋅ Diogo de Lucena ⋅ Judd Rosenblatt ⋅ Michael Graziano

Abstract

Log in and register to view live content