The Steerable Self: Activation-Based Interventions for Big Five Personality Control in LLMs
Abstract
Recent work in Mechanistic Interpretability and Alignment has explored the steerability and localization of persona traits, usually focusing on alignment-relevant traits such as 'evil' or observing task generalization in other domains, rather than varied personality expression. In this work, we systematically compare different interventions for steering Big Five personality expression in small instruction-tuned Llama models, across both high- and low-trait directions. Our results demonstrates that activation addition produce the most consistent bidirectional dose-response, raising mean judge scores by ~25\% at the optimal layer. Probe steering and directional ablation do not produce reliable behavioural shifts, despite sharing the same peak layers. Our results establish difference-in-means as targeting a geometrically-relevant direction in the residual stream; one that is both linearly organised at early-to-mid layers and writable in both orientations.