Rethinking AI Alignment: From Static Rewards to Social Reinforcement Learning
Abstract
Despite the widespread adoption of Reinforcement Learning from Human Feedback, state-of-the-art AI systems remain prone to two persistent failure modes: hallucination (producing fluent but false content) and moral drift (the convergence towards exploitative or harmful equilibria). We argue that these are not distinct phenomena but plausibly arise from a single underlying cause: feedback collapse. This occurs when complex human values are compressed into fixed scores and frozen offline, decoupling the training signal from the true goals of truth and rightness. We argue that optimizing for these proxies tends to misalign the learning process under distribution shift. To address this, we propose Social Reinforcement Learning (Social RL) as a promising route to structurally enforcing feedback integrity. By situating agents in social environments driven by peer critique, reputation, observation, and sanction, Social RL treats alignment as an ongoing negotiation rather than a static specification problem, and offers mechanisms for correcting epistemic errors and stabilizing ethical norms in open-ended environments.