Tracing the Persona Circuit: How Large Language Models Encode and Express Character Traits
Abstract
Large Language Models (LLMs) demonstrate remarkable potential in role-playing tasks but frequently suffer from personality decay—termed "Out-of-Character" (OOC) behavior—during prolonged interactions. While heuristic strategies exist to align model behaviors, the internal computational dynamics driving personality expression remain opaque. A fundamental barrier to decoding these mechanisms is a metric gap: while standard causal attribution paradigms target atomic, single-token outcomes, personality manifests as a holistic, multi-token behavioral tendency. We bridge this gap via the Latent Persona Vector, a differentiable proxy enabling the first fine-grained causal tracing of personality circuits. This metric reveals a structured "Preparation-Establishment-Expression" dynamic and identifies the mechanistic root of OOC behavior not as knowledge erasure, but as generic prior dominance. Specifically, we find that intrinsic assistant priors suppress emergent persona intents during the critical "Establishment" phase. Guided by this diagnosis, we propose surgically recalibrating the signal magnitude in fewer than 5% of attention heads. This targeted intervention effectively counteracts prior suppression, significantly restoring character consistency while preserving general reasoning capabilities.