Facts in Stats: Impacts of Pretraining Diversity on Language Model Generalization
Abstract
Language models are pretrained on sequences that blend statistical regularities (structures making text fluent) with factual associations between specific tokens (corresponding to knowledge of facts). While recent work suggests that the variability of their interaction, such as paraphrases of factual associations, critically determines generalization ability, we lack a systematic analysis of these impacts. This paper introduces a flexible synthetic testbed that combines a statistical stream of generic tokens with an abstract factual stream of source-target token pairs, enabling fine-grained control over their interaction. Specifically, the design enables the independent control of diversity nature by manipulating stream composition (contextual structure) and the level of diversity by varying which statistical streams each fact appears in. Through controlled experiments, we find that while higher contextual diversity delays in-distribution (ID) factual accuracy, its effect on out-of-distribution (OOD) generalization depends critically on contextual structure. In some cases, OOD performance follows the same trend as ID, but in others, diversity becomes essential for non-trivial factual learning. Even when low diversity prohibits factual recall, optimal diversity levels depend on training duration. Beyond factual recall failures, we identify structures where statistical generalization fails independently, and others where both capabilities degrade simultaneously. This demonstrates how the interplay between contextual design and diversity level impacts different aspects of generalization. Furthermore, through a series of controlled interventions on the model components, we trace the generalization failures to distinct optimization bottlenecks, highlighting the importance of the learned embedding and unembedding layers. Overall, our synthetic framework allows us to isolate effects that would be confounded in large-scale studies, thus offering a controlled testbed for future investigations.