Skip to yearly menu bar Skip to main content


Poster
in
Workshop: DMLR Workshop: Data-centric Machine Learning Research

Data Similarity is Not Enough to Explain Language Model Performance

Gregory Yauney · Emily Reif · David Mimno


Abstract:

Large language models achieve high few-shot performance on many but not all downstream tasks. The interaction between pretraining and downstream data is commonly assumed to influence this variance: a task with data that is more similar to a model's pretraining dataset is assumed to be easier for that model. We test whether general textual similarity measures (embedding-, token- and model-based) correlate with large language model performance through a large-scale comparison of the Pile and C4 pretraining datasets with downstream benchmarks. Surprisingly, we find no correlation between performance and similarity across various models and dataset similarities.

Chat is not available.