An Information-Theoretic Criterion for Efficient Data Synthesis
Abstract
Synthetic data becomes crucial for large language model training, but its effectiveness is highly inconsistent. Here we provide an information-theoretic understanding of such inconsistency. Synthetic data is effective only when the generation-training loop is information-open: it has continuous information injections by external signals (verifiers, environments, or rubrics) that supply task-relevant signal not implied by the model's current probability distribution. When the loop is information-closed and relies mainly on self-generated samples, repeated processing tends to degrade performance. Based on this criterion, we further study factors that influence the efficiency of information injections by different synthetic methods. We argue that information-efficient methods often lead to strong generalization, which relies on simple, unified signals that ignore nuisance variation across examples, rather than adding more instance-level labels. Our work thus provides an operational guide for designing and understanding synthetic data methods, echoing Sutton's "bitter lesson" -- that compute-scalable, general methods ultimately beat human-built structure in the long run.