Poster
in
Workshop: DMLR Workshop: Data-centric Machine Learning Research
Early Experiments in Scalable Dataset Selection for Self-Supervised Learning in Geospatial Imagery Models
Muhammed Razzak · Anthony Ortiz · Caleb Robinson
Dataset selection plays a crucial role in large-scale self-supervised geospatial imagery models, particularly with regard to the impact of dataset diversity on model efficacy. This study investigates the effectiveness of diverse geospatial imagery datasets in enhancing downstream task performance of a self-supervised mdoel trained on such data. To address this, we propose a scalable online clustering method for dataset selection that is designed to maximize diversity. Through a series of experiments on BigEarthNet, we demonstrate both the efficacy of our approach for increasing downstream task performance and its ability to significantly enhance dataset diversity. The results reveal substantial improvements in both supervised and self-supervised training performance. Specifically, our findings demonstrate up to ~5\% increase in accuracy for supervised tasks and a notable ~6\% improvement on downstream tasks following self-supervised learning, surpassing the capabilities of traditional dataset selection methods used in geospatial domain. These early results highlight the practical value of our approach in constructing robust self-supervised datasets from extensive archives of geospatial imagery, thereby unlocking new possibilities for advanced geospatial analysis and applications.