Skip to yearly menu bar Skip to main content


Poster
in
Workshop: DMLR Workshop: Data-centric Machine Learning Research

Probing Heterogeneous Pretraining Datasets with Small Curated Datasets

Gregory Yauney · Emily Reif · David Mimno


Abstract:

Language models rely on increasingly large web-scraped datasets for pretraining. The size of these datasets prevents manual curation, and existing automated quality filters are heuristic and limited. Characterizing these datasets is an open problem. We present preliminary work on documenting and visualizing pretraining datasets by mapping their similarity to downstream benchmark datasets, which are often hand-curated and more focused in style and content. We show this method finely characterizes popular pretraining datasets, supplementing existing characterizations that can be used for quality filtering.

Chat is not available.