Disentangling Consensus and Value-Specific Representations for Controllable Pluralistic Value Alignment of LLMs
Abstract
With the widespread deployment of large language models (LLMs), aligning model outputs with pluralistic human values has become an important research problem. Recent approaches that train task-specific experts and merge them through parameter aggregation have shown promise for pluralistic alignment. However, these methods often overlook the intrinsic complexity of real-world value data, where multiple correlated value dimensions coexist, resulting in highly similar and entangled expert representations. Consequently, modifying the contribution of one value expert may unintentionally influence other values, limiting fine-grained controllability. To address this issue, we propose DisAlign, a model-merging framework that explicitly decomposes value representations into consensus and value-specific components using an information-geometric perspective. DisAlign first extracts a consensus anchor and subspace to capture shared structure across values, and then applies spectral decomposition to the residual representations to construct disentangled value subspaces. This design enables more precise and independent modulation of multiple values. Experiments on three datasets covering different value frameworks demonstrate that DisAlign consistently improves value disentanglement and achieves more accurate pluralistic value control compared to existing baselines. Our code is available at \url{https://anonymous.4open.science/r/DisAlign-7F35}