Poster
in
Workshop: Data-centric Machine Learning Research (DMLR): Datasets for Foundation Models

Automatic Data Curation for Self-Supervised Learning: A Clustering-Based Approach

Van Huy Vo ⋅ Vasil Khalidov ⋅ Timothée Darcet ⋅ Théo Moutakanni ⋅ Nikita Smetanin ⋅ Marc Szafraniec ⋅ Hugo Touvron ⋅ Camille Couprie ⋅ Maxime Oquab ⋅ Armand Joulin ⋅ Herve Jegou ⋅ Patrick Labatut ⋅ Piotr Bojanowski

Abstract

Self-supervised features are the cornerstone of modern machine learning systems. They are typicallypre-trained on data collections whose construction and curation typically require extensive humaneffort. This manual process has some limitations similar to those encountered in supervised learning,e.g., the crowd-sourced selection of data is costly and time-consuming, preventing scaling the datasetsize. In this work, we consider the problem of automatic curation of high-quality datasets for self-supervised pre-training. We posit that such datasets should be large, diverse and balanced, andpropose a clustering-based approach for building ones satisfying all these criteria. Our method involvessuccessive and hierarchical applications of k-means on a large and diverse data repository to obtainclusters that distribute uniformly among data concepts, followed by a hierarchical, balanced samplingstep from these clusters. Extensive experiments on two different data domains including web-basedimages and text show that features trained on our automatically curated datasetsoutperform those trained on uncurated data while being on par or better than ones trained on manuallycurated data.

Chat is not available.