Timezone: »

D4: Document Deduplication and Diversification
Kushal Tirumala · Daniel Simig · Armen Aghajanyan · Ari Morcos

Over recent years, practitioners have poured an increasing amount of compute and data into training large language models (LLMs), usually by doing one-pass learning on randomly selected tokens from large-scale web corpora. While training on ever-larger portions of web scrapes leads to consistent performance improvement, there has been little work exploring the effect of data selection on pre-training and downstream performance outside of simple de-duplication methods such as MinHash. Here, we show that careful data selection (on top of de-duplicated data) via pre-trained model embeddings can speed up training (20% efficiency gains) and improve downstream accuracy in LLMs (up to 2%). Furthermore, we show that repeating data intelligently selected by D4 consistently outperforms baseline training (while repeating random data performs worse than baseline training). This calls into question common practitioner intuition that randomly selecting new data is optimal for LLM pre-training. We hope our results motivate the community to rethink current standards in data selection for LLM pre-training.

Author Information

Kushal Tirumala (FAIR/California Institute of Technology)
Daniel Simig (Facebook)
Armen Aghajanyan (FAIR)
Ari Morcos (FAIR, Meta AI)

More from the Same Authors