Poster
in
Workshop: Pluralistic Alignment Workshop

Playing Devil’s Advocate: Off-the-Shelf Persona Vectors Rival Targeted Steering for Sycophancy

Ishaan Kelkar ⋅ Nebras Alam ⋅ Vikram Kakaria ⋅ Madhur Panwar ⋅ Vasu Sharma ⋅ Maheep Chaudhary

Project Page

Abstract

We study the effect of different persona on \textbf{sycophancy}: model's agreement with users even when the user is incorrect. The standard mitigation, Contrastive Activation Addition (CAA), derives a steering direction from labelled pairs of sycophantic and honest responses. This study evaluates whether off-the-shelf persona steering vectors, originally developed for general role-playing and not trained on sycophancy data, can serve as an alternative. In two instruction-tuned models, steering toward personas characterised by doubt or scrutiny reduces sycophancy to approximately and of CAA's effect, and, unlike CAA, maintains accuracy when the user is correct. The effect is also asymmetric: steering toward agreeable personas does not produce a mirror increase in sycophancy. Geometrically, the persona vector is largely independent of the direction of sycophancy in activation space. Collectively, these findings suggest that sycophancy is better understood as a persona-level property rather than a single steerable direction. We release our code here: \url{https://anonymous.4open.science/r/Sycophancy-Steering-9DF0/}.