Galileo: Learning Global & Local Features of Many Remote Sensing Modalities
Abstract
We introduce a highly multimodal transformer that analyzes many remote sensing modalities --- multispectral optical, synthetic aperture radar, elevation maps, weather, pseudo-labels, and more --- across space and time. These inputs are useful for diverse remote sensing tasks, e.g., crop mapping, flood detection, etc. However, learning representations of remote sensing data is challenging; e.g., objects of interest vary massively in scale, from small vessels (1-2 pixels and transient) to glaciers (thousands of pixels and persistent). We present a novel self-supervised learning algorithm that extracts multi-scale features through masked modeling. Our two-task approach consists of global and local training objectives that differ w.r.t. prediction targets (deep vs. shallow) and masking strategies (structured vs. not).With a single pretrained encoder, our Galileo model outperforms SoTA models for satellite images and pixel-time series --- extensively evaluated over eleven benchmarks spanning multiple task types.