EGGROLL-IPO: Pluralistic Alignment via Decentralised Post-Training with Population Preferences
Alfie Lamerton ⋅ Bidipta Sarkar ⋅ Roberto-Rafael Maura-Rivero ⋅ Jakob Foerster
Abstract
Frontier AI companies align large language models to constitutions and model specs they author themselves, not to the preferences of the populations the models serve. This is an organisational choice rather than a technical necessity, but the dominant alignment pipeline reinforces it: gradient-based reinforcement learning from human feedback (RLHF) requires dense tensor communication between workers, ruling out protocols in which different workers feed heterogeneous preferences into training. EGGROLL recently showed that large-scale evolution strategies require only a scalar fitness per worker, which makes it practical to train a single shared model from compute, queries, and preferences contributed by a population. The remaining question is which loss should drive that protocol: aggregating preferences naively recovers Borda scoring, which violates Condorcet consistency and independence of irrelevant alternatives (IIA). We argue that the loss should target the maximal lottery, the unique probabilistic social choice function satisfying Arrow's axioms, and Online Identity Preference Optimisation (IPO) loss does this. We introduce EGGROLL-IPO, which uses the Online IPO loss as scalar fitness in EGGROLL, and show across three controlled experiments (cyclic preferences, IIA, and sequence-level convergence) that it outperforms a naive $\pm 1$ preference fitness with the optimiser held at EGGROLL.
Successful Page Load