When Softmax Fails at the Top: Extreme‑Value Corrections for InfoNCE
Abstract
Contrastive learning is often trained with the InfoNCE loss, which uses a softmax over similarities to make the positive pair score higher than many negatives. Beyond its connection to mutual information, this softmax link has a precise probabilistic meaning: it is the maximum likelihood objective of a discrete choice model Plackett Luce with Gumbel noise. We show that this implicit noise model can be systematically wrong in modern settings where similarities are bounded, such as cosine normalized embeddings. In the bounded regime, the most competitive negatives pile up near the score ceiling, and extreme value theory predicts Weibull rather than Gumbel behavior for these extremes. We confirm this prediction empirically by measuring Weibull style tail behavior in the hardest negatives throughout InfoNCE training. Motivated by this mismatch, we propose WEINCE, a simple modification of InfoNCE that targets the extreme score regime directly. Across standard benchmarks and backbone architectures, WEINCE improves downstream linear evaluation over InfoNCE with minimal changes to existing training pipelines, showing that modeling the geometry of extremes can yield stronger contrastive representations.