Poster
in
Workshop: Data-centric Machine Learning Research (DMLR): Datasets for Foundation Models
Automatic Data Curation for Self-Supervised Learning: A Clustering-Based Approach
Van Huy Vo · Vasil Khalidov · Timothée Darcet · Théo Moutakanni · Nikita Smetanin · Marc Szafraniec · Hugo Touvron · Camille Couprie · Maxime Oquab · Armand Joulin · Herve Jegou · Patrick Labatut · Piotr Bojanowski
Self-supervised features are the cornerstone of modern machine learning systems. They are typicallypre-trained on data collections whose construction and curation typically require extensive humaneffort. This manual process has some limitations similar to those encountered in supervised learning,e.g., the crowd-sourced selection of data is costly and time-consuming, preventing scaling the datasetsize. In this work, we consider the problem of automatic curation of high-quality datasets for self-supervised pre-training. We posit that such datasets should be large, diverse and balanced, andpropose a clustering-based approach for building ones satisfying all these criteria. Our method involvessuccessive and hierarchical applications of k-means on a large and diverse data repository to obtainclusters that distribute uniformly among data concepts, followed by a hierarchical, balanced samplingstep from these clusters. Extensive experiments on two different data domains including web-basedimages and text show that features trained on our automatically curated datasetsoutperform those trained on uncurated data while being on par or better than ones trained on manuallycurated data.