OCNR: Stabilizing Self-Play by Mitigating Iteration-Collapse With One-Class Novelty Rewards
Abstract
Training large language models via self-play often suffers from a persistent iteration-collapse, where performance initially improves but subsequently regresses as training iterations increase. We analyze this phenomenon as arising from cross-iteration degeneration, where the task-generation distribution becomes increasingly confined to a narrow subset of familiar (seen) problems, weakening the effective learning signal and destabilizing training. To address this issue, we propose a plug-in approach that augments existing self-play pipelines with a one-class novelty reward. A Seen Detector trained on a historical buffer of previously used training problems identifies in-support instances and discourages redundant generation by the questioner, thereby steering exploration toward under-explored yet learnable regions. Experimental results show that the proposed method mitigates iteration-collapse during iterative training and yields consistent improvements.