Benchmarking Pluralistic Alignment Through Persona-Conditioned Behavioral Evaluation
Abstract
As language models (LMs) continue to be used more and more frequently by individuals, enterprises, and even governments, ensuring that they are appropriately aligned to human values has become more and more important. The extent to which a particular LM is aligned is often determined by its performance on various safety and preparedness benchmarks; for example, a model may be prompted with various harmful inputs with the intent of seeing if it is able to properly refuse to provide answers to them. However, such benchmarks and tests are often curated by and for individuals with similar socioeconomic backgrounds, cultures, and affiliations. Plurlastic alignment refers to the general goal of creating AI models that are aligned with a diverse set of principles and values. In this work, we propose a generalized methodlogy for testing LMs on their ability to be plurastically aligned by both adapting an existing benchmark to be tested with personas of varying backgrounds, and by creating a new benchmark designed to be pluralstic in nature. We find that contemporary LMs differ significantly with their responses to different backgrounds, with some drifting by almost 12 percent, when compared to the baseline.