Timezone: »
Over recent years, practitioners have poured an increasing amount of compute and data into training large language models (LLMs), usually by doing one-pass learning on randomly selected tokens from large-scale web corpora. While training on ever-larger portions of web scrapes leads to consistent performance improvement, there has been little work exploring the effect of data selection on pre-training and downstream performance outside of simple de-duplication methods such as MinHash. Here, we show that careful data selection (on top of de-duplicated data) via pre-trained model embeddings can speed up training (20% efficiency gains) and improve downstream accuracy in LLMs (up to 2%). Furthermore, we show that repeating data intelligently selected by D4 consistently outperforms baseline training (while repeating random data performs worse than baseline training). This calls into question common practitioner intuition that randomly selecting new data is optimal for LLM pre-training. We hope our results motivate the community to rethink current standards in data selection for LLM pre-training.
Author Information
Kushal Tirumala (FAIR/California Institute of Technology)
Daniel Simig (Facebook)
Armen Aghajanyan (FAIR)
Ari Morcos (FAIR, Meta AI)
More from the Same Authors
-
2023 : SemDeDup: Data-efficient learning at web-scale through semantic deduplication »
Amro Abbas · Daniel Simig · Surya Ganguli · Ari Morcos · Kushal Tirumala -
2022 Poster: Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time »
Mitchell Wortsman · Gabriel Ilharco · Samir Gadre · Becca Roelofs · Raphael Gontijo Lopes · Ari Morcos · Hongseok Namkoong · Ali Farhadi · Yair Carmon · Simon Kornblith · Ludwig Schmidt -
2022 Poster: Investigating Generalization by Controlling Normalized Margin »
Alexander Farhang · Jeremy Bernstein · Kushal Tirumala · Yang Liu · Yisong Yue -
2022 Spotlight: Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time »
Mitchell Wortsman · Gabriel Ilharco · Samir Gadre · Becca Roelofs · Raphael Gontijo Lopes · Ari Morcos · Hongseok Namkoong · Ali Farhadi · Yair Carmon · Simon Kornblith · Ludwig Schmidt -
2022 Spotlight: Investigating Generalization by Controlling Normalized Margin »
Alexander Farhang · Jeremy Bernstein · Kushal Tirumala · Yang Liu · Yisong Yue -
2022 Poster: COAT: Measuring Object Compositionality in Emergent Representations »
Sirui Xie · Ari Morcos · Song-Chun Zhu · Shanmukha Ramakrishna Vedantam -
2022 Spotlight: COAT: Measuring Object Compositionality in Emergent Representations »
Sirui Xie · Ari Morcos · Song-Chun Zhu · Shanmukha Ramakrishna Vedantam -
2021 Poster: CURI: A Benchmark for Productive Concept Learning Under Uncertainty »
Shanmukha Ramakrishna Vedantam · Arthur Szlam · Maximilian Nickel · Ari Morcos · Brenden Lake -
2021 Spotlight: CURI: A Benchmark for Productive Concept Learning Under Uncertainty »
Shanmukha Ramakrishna Vedantam · Arthur Szlam · Maximilian Nickel · Ari Morcos · Brenden Lake -
2021 Poster: ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases »
Stéphane d'Ascoli · Hugo Touvron · Matthew Leavitt · Ari Morcos · Giulio Biroli · Levent Sagun -
2021 Spotlight: ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases »
Stéphane d'Ascoli · Hugo Touvron · Matthew Leavitt · Ari Morcos · Giulio Biroli · Levent Sagun -
2019 Workshop: Identifying and Understanding Deep Learning Phenomena »
Hanie Sedghi · Samy Bengio · Kenji Hata · Aleksander Madry · Ari Morcos · Behnam Neyshabur · Maithra Raghu · Ali Rahimi · Ludwig Schmidt · Ying Xiao -
2018 Poster: Measuring abstract reasoning in neural networks »
Adam Santoro · Feilx Hill · David GT Barrett · Ari S Morcos · Timothy Lillicrap -
2018 Oral: Measuring abstract reasoning in neural networks »
Adam Santoro · Feilx Hill · David GT Barrett · Ari S Morcos · Timothy Lillicrap