Pedagogical Games: Paths to Generalisation for Agentic Moral Alignment
Abstract
Can a model learn to be moral by playing games? While existing alignment methods rely predomi- nantly on learned preference signals and opaque moral values, we investigate whether fine-tuning with explicitly defined moral rewards can in- duce transferable cooperative dispositions in LLM agents. Generalization is evaluated across three dimensions: strategic complexity, model capabil- ity, and naturalistic complexity. We show that an LLM finetuned exclusively on numerical multi- agent games (with no natural language moral con- tent), reduces harmful actions by up to 35% in semantically unrelated interactive environments. However, this generalization occurs only if train- ing on iterated public goods games but not pair- wise reciprocity games, and if environment com- plexity is matched to model capability. Our results provide evidence that intrinsic moral fine-tuning is a promising direction for LLM alignment, and offer preliminary answers to the questions: which environments work, for which models, and why.