Poster
in
Workshop: Pluralistic Alignment Workshop

The Wedge Questions: Latent Cultural Boundaries in LLMs via Persona Projection Divergence

Yejin Son ⋅ Yongjin Yang ⋅ Ryan Faulkner ⋅ Matt Ratto ⋅ Seungwon Lim ⋅ Youngjae Yu ⋅ Zhijing Jin

Project Page

Abstract

Large Language Models (LLMs) contain a wealth of cultural data, often functioning as a bridge into diverse societies. This demands cultural alignment with diverse moral landscapes and social conventions. Prior work has documented how post-training alignment homogenizes expression in the model's generated text. Our study reveals the complementary internal picture: rather than erasing cultural distinctions, alignment appears to sharpen their geometric encoding in the latent space. We term this phenomenon the "Alignment Paradox". We suggest that this sharpening is a natural byproduct of homogenization: to consistently produce neutral outputs, a model must first know where cultural distinctions lie, which entails encoding them more precisely in its internal representations. To surface and quantify this effect, we propose Persona Projection Divergence (PPD), a geometric measure to identify cultural value boundaries within the latent space. While alignment-induced homogenization often renders cultural distinctions invisible to surface-level text metrics, PPD uncovers these boundaries within the latent space. Our framework provides a non-redundant diagnostic signal that captures internal value conflicts, uncovering latent polarities that remain hidden behind a neutralized textual surface. By probing these internal axes before they are collapsed into a neutral facade, we establish a scalable diagnostic tool to identify "Wedge Questions" that trigger latent value conflicts otherwise hidden behind homogenized outputs.