Mission Impossible: Universal Moral Alignment
Abstract
Universal moral alignment for large language models (LLMs) is often framed as the goal of learning a single policy that behaves in accordance with human values. This framing assumes that sufficiently capable models can approximate a coherent and universally valid moral objective. We argue that this assumption is false in pluralistic settings. Drawing on a preference-learning view of alignment and insights from social choice theory, we show that when different groups hold internally coherent but conflicting moral judgments over the same context-action pairs, no non-degenerate single policy can satisfy all groups simultaneously. Under stronger forms of disagreement, aggregation can even produce policies that are misaligned with every group. We outline a constructive agenda that replaces universal moral alignment with pluralistic and procedurally explicit alternatives, including normative governance mechanisms, impossibility-aware evaluation, and richer representations of human preferences that make disagreement visible rather than averaging it away.