Skip to yearly menu bar Skip to main content


Poster
in
Workshop: DMLR Workshop: Data-centric Machine Learning Research

Early Experiments in Scalable Dataset Selection for Self-Supervised Learning in Geospatial Imagery Models

Muhammed Razzak · Anthony Ortiz · Caleb Robinson


Abstract:

Dataset selection plays a crucial role in large-scale self-supervised geospatial imagery models, particularly with regard to the impact of dataset diversity on model efficacy. This study investigates the effectiveness of diverse geospatial imagery datasets in enhancing downstream task performance of a self-supervised mdoel trained on such data. To address this, we propose a scalable online clustering method for dataset selection that is designed to maximize diversity. Through a series of experiments on BigEarthNet, we demonstrate both the efficacy of our approach for increasing downstream task performance and its ability to significantly enhance dataset diversity. The results reveal substantial improvements in both supervised and self-supervised training performance. Specifically, our findings demonstrate up to ~5\% increase in accuracy for supervised tasks and a notable ~6\% improvement on downstream tasks following self-supervised learning, surpassing the capabilities of traditional dataset selection methods used in geospatial domain. These early results highlight the practical value of our approach in constructing robust self-supervised datasets from extensive archives of geospatial imagery, thereby unlocking new possibilities for advanced geospatial analysis and applications.

Chat is not available.