Poster
in
Workshop: Data-centric Machine Learning Research (DMLR): Datasets for Foundation Models
Data pruning and neural scaling laws: fundamental limitations of score-based algorithms
Fadhel Ayed · Soufiane Hayou
Abstract:
Data pruning algorithms are commonly used to reduce the memory and computational cost of the optimization process. Recent empirical results reveal that random data pruning remains a strong baseline and outperforms most existing data pruning methods in the high compression regime, i.e. where a fraction of $30\%$ or less of the data is kept. This regime has recently attracted a lot of interest as a result of the role of data pruning in improving the so-called neural scaling laws. In this work, we focus on score-based data pruning algorithms and show theoretically and empirically why such algorithms fail in the high compression regime. We demonstrate ``No Free Lunch" theorems for data pruning and discuss potential solutions to these limitations. The present document is an extended abstract; the complete paper is published and can be accessed at \url{https://openreview.net/forum?id=iRTL4pDavo}.
Chat is not available.