Pluralistic On-Policy Self-Distillation
Abstract
Language feedback often contains multiple valid persona-dependent directions for improvement: a critique may ask a response to match the style of a professional advisor, a travel guide, or an artistic critic. This creates a challenge for pluralistic alignment, where distinct persona-specific feedback signals should be preserved rather than collapsed into a single reward or generic target. We propose Multi-Action-Head On-Policy Self-Distillation (MAH-OPSD), which combines persona-specific feedback with dense token-level on-policy distillation. For each prompt, MAH-OPSD first generates persona-specific rubrics to elicit more targeted critiques than generic feedback criteria. It then trains multiple persona action heads on a shared backbone: each head generates a response from the same prompt, receives its own rubric-guided critique, and distills from a critique-conditioned base model as its teacher. A lightweight router mixes the learned action heads based on the prompt, enabling adaptive response generation at inference time. In a five-persona pilot, MAH-OPSD improves persona alignment over objective-collapsing baselines, showing the benefit of preserving persona-specific feedback pathways rather than merging all feedback into a single policy.