A Theory of Contrastive Learning with Natural Images
Abstract
Why does contrastive learning with simple images and augmentations yield useful representations for downstream tasks? We answer this by analytically computing the optimal contrastive learning (CL) weights in simple one-hidden-layer CNNs using only dataset statistics. For a range of basic augmentations and any image dataset with stationary statistics, we prove that such CNNs trained with a contrastive loss learn sinusoidal first-layer filters. With augmentations that combine translation and adding noise, the CNN learns partial whitening of the input and measures frequency contrast: differences between power at frequencies with the same expected power. The selected frequencies and their weights can be computed using a simple “waterfilling” algorithm given the dataset’s expected power spectrum. Experiments with eight image datasets show that CNNs trained with SGD empirically learn partial whitening and the predicted frequency contrasts, and the usefulness of the learned representation for recognition depends on both the augmentations and the mismatch between the training and test power spectra.