The Generalization Spectrum: A Chromatographic Approach to Evaluating Learning Algorithms
Abstract
Traditional evaluations measure a learning algorithm's final performance on an i.i.d. test set, reducing learning to a single aggregate score. This approach obscures a fundamental question: to what extent does learning from a specific example generalize to others? Such per-sample generalization—akin to learning by analogy in human cognition—captures how far the knowledge extracted from one example can transfer, yet remains invisible to standard benchmarks. We introduce the Generalization Spectrum, an evaluation framework designed to expose this hidden dimension. For each training example, we construct a controlled suite of test variants arranged by increasing transfer distance—from exact recall to implementation transfer across languages, context transfer under complete narrative re-framing, category-matched in-domain problems, and an unpaired baseline. By tracking performance across these distances, we reveal not just whether an algorithm learns, but how far that learning extends. We instantiate this framework on competitive programming, using a synthetic generation pipeline seeded with recent problems to mitigate contamination. Across ICL, SFT, RFT, and RL, we find two levers that shape generalization radius: \textbf{(i) the learning algorithm}---how to learn from a fixed set of training instances---where RL yields markedly stronger near-transfer than SFT/RFT under matched memorization and learns more transferable structure; and \textbf{(ii) the learning content}---what extra signal to provide or reformat given the same seeds---where abstract ICL demonstrations and on-policy SFT targets yield more reliable transfer than concrete code and off-policy supervision.