Rethinking Diversity-Preserving RL for Pluralistic Alignment: Empirical Evidence from Rubric-Grounded Moral Reasoning
Abstract
Pluralistic alignment is often associated with preserving diverse high-reward responses, especially in moral reasoning where multiple answers may be defensible under different value systems. This paper studies that assumption in a rubric-grounded reinforcement learning with verifiable rewards (RLVR) setting. Using MoReBench, we compare representative reward-maximizing methods and a distribution-matching baseline under a shared training and evaluation pipeline enabled by a distilled local judge. Across two model families and two moral-reasoning subtasks, reward-maximizing methods match or outperform the distribution-matching baseline. Semantic visualization and qualitative case analysis further suggest that, under current rubric-grounded rewards, high-reward moral-reasoning responses are often more concentrated than the surface pluralism of the task might suggest. These results do not imply that diversity is unimportant in alignment. Rather, they indicate that the need for diversity-preserving RL should be established empirically from the evaluator-induced reward landscape. For pluralistic alignment, this shifts attention from domain-level intuitions alone toward the joint role of benchmark design, reward definition, and optimization objective.