On the Generalization Gap in Self-Evolving Language Model Reasoning
Abstract
Recent work suggests that LLMs can improve their abilities through \textit{self-evolution}, using only internally generated supervision. A central open question, however, is not whether self-evolution can help, but: \textit{how far is it from oracle-supervised training under minimal assumptions?} To address this question, we present a controlled empirical analysis of LLM self-evolution under a strict formulation: self-evolution is allowed access only to (i) an unlabeled prompt set and (ii) a base language model, with all supervision signals generated from this model. Under this formulation, we can evaluate many self-evolution approaches in a unified preference optimization framework. Specifically, we analyze four representative self-evolution methods, ranging from single-round verification to multi-turn feedback, iterative training, and curriculum learning. For our primary analysis, we use a clean setting based on the Knights and Knaves logical reasoning dataset, which provides deterministic solutions, systematic verification, and a hierarchy of difficulty levels that enables an evaluation of easy-to-hard generalization. Across this controlled setting, we find that increasingly complex self-evolution strategies yield consistent but limited gains. In general, a substantial performance gap persists relative to oracle supervision. One strategy stands out as effective: we nearly match the oracle performance by using a larger model (Gemma 12B) with iterative revision based on natural language feedback. We also study self-evolution on the OpenThoughts reasoning corpus and evaluate on standard problem-solving benchmarks. In this regime, self-evolution only leads to modest improvements, including when using more resource-intensive strategies or online RL. Overall, our results shed new light on the empirical limits of various types of self-evolution.