Skip to yearly menu bar Skip to main content


Poster
in
Workshop: DMLR Workshop: Data-centric Machine Learning Research

D4: Document Deduplication and Diversification

Kushal Tirumala · Daniel Simig · Armen Aghajanyan · Ari Morcos


Abstract:

Over recent years, practitioners have poured an increasing amount of compute and data into training large language models (LLMs), usually by doing one-pass learning on randomly selected tokens from large-scale web corpora. While training on ever-larger portions of web scrapes leads to consistent performance improvement, there has been little work exploring the effect of data selection on pre-training and downstream performance outside of simple de-duplication methods such as MinHash. Here, we show that careful data selection (on top of de-duplicated data) via pre-trained model embeddings can speed up training (20% efficiency gains) and improve downstream accuracy in LLMs (up to 2%). Furthermore, we show that repeating data intelligently selected by D4 consistently outperforms baseline training (while repeating random data performs worse than baseline training). This calls into question common practitioner intuition that randomly selecting new data is optimal for LLM pre-training. We hope our results motivate the community to rethink current standards in data selection for LLM pre-training.

Chat is not available.