Skip to yearly menu bar Skip to main content


Poster

Contrasting Multiple Representations with the Multi-Marginal Matching Gap

Zoe Piran · Michal Klein · James Thornton · Marco Cuturi


Abstract: Learning meaningful representations of complex objects that can be seen through multiple ($k\geq 3$) views or modalities is a core task in machine learning. Existing methods extend the InfoNCE loss, originally designed for paired views ($k=2$), either by instantiating $\tfrac12k(k-1)$ InfoNCE pairs, or by using reduced embeddings, following a \textit{one vs. average-of-rest} strategy. We propose the multi-marginal matching gap (M3G), a radically different loss that borrows tools from multi-marginal optimal transport theory (MM-OT). Given $n$ points, each seen as a $k$-tuple of embeddings, our loss contrasts the cost of matching these $n\times k$ vectors $k$-tuples at a time to the MM-OT polymatching cost. While the exponential complexity (w.r.t. number of views $k$) of the MM-OT problem may seem daunting, our experiments show that the multi-marginal Sinkhorn algorithm can easily solve such problems for $k=3\sim 6$ views. Additionally, and thanks to Danskin's theorem, the gradient of the M3G loss can be recovered without running a backward pass. Our experiments demonstrate performance improvements over multiview extensions of InfoNCE, for both self-supervised and multimodal tasks.

Live content is unavailable. Log in and register to view live content