Poster
in
Workshop: ICML 2024 Workshop on Foundation Models in the Wild
Generalization vs. Memorization: Tracing Language Models' Capabilities Back to Pretraining Data
Antonis Antoniades · Xinyi Wang · Yanai Elazar · Alfonso Amayuelas · Alon Albalak · Kexun Zhang · William Wang
Keywords: [ corpus ] [ NLP ] [ language modeling ] [ memorization ] [ pretraining ] [ generalization ] [ Emergence ]
Despite the proven utility of large language models (LLMs) in real-world applications, there remains a lack of understanding regarding how they leverage their large-scale pretraining text corpora to achieve such capabilities. In this work, we investigate the interplay between generalization and memorization in pre-trained LLMs at scale, through a comprehensive n-gram analysis of their training data. Our experiments focus on three general task types: translation, question-answering, and multiple-choice reasoning. With various sizes of open-source LLMs and their pretraining corpora, we observe that as the model size increases, the task-relevant n-gram pair data becomes increasingly important, leading to improved task performance, decreased memorization, stronger generalization, and emergent abilities. Our results support the hypothesis that LLMs' capabilities emerge from a delicate balance of memorization and generalization with sufficient task-related pretraining data, and point the way to larger-scale analyses that could further improve our understanding of these models.