Controllable and explainable personality sliders for LLMs at inference time
Abstract
Aligning Large Language Models (LLMs) with specific personas typically relies on Supervised Fine-Tuning (SFT) or Reinforcement Learning from Human Feedback (RLHF); however, these methods are resource-intensive, requiring expensive data collection and distinct model training for each target personality. In this work, we propose a parameter-efficient framework for continuous, multi-dimensional personality control via inference-time activation steering. Our approach addresses the challenge of combining multiple interventions by iteratively retraining probes on the residual stream modified by prior traits, ensuring compatibility. Once established, these steering vectors function as modular, reusable primitives; users can instantly synthesize novel, complex personality profiles by simply adjusting steering coefficients (α) without any additional training. To support this, we introduce an automated pipeline that identifies optimal intervention layers via activation separation analysis and calibrates coefficients via hyperparameter optimization to maximize alignment while constraining perplexity. Empirical evaluations validate individual trait shifts using an LLM-as-a-judge framework and demonstrate, via the Big Five inventory, that our method effectively modulates the model's holistic personality profile without updating base model parameters.