Revealing Differences in Multi-Modal Embeddings via Constrained Kernel Analysis
Abstract
Multi-modal representation models such as CLIP, SigLIP, and their variants are widely used to represent data across multiple modalities in modern learning systems. While these models are commonly evaluated through downstream performance, the analysis of their structural differences in how multi-modal representations group data across modalities remains inadequate. In this work, we aim to identify modality pairs and sample subsets that induce divergent grouping behavior between two candidate embeddings. We propose \emph{Kernel Optimization for Discrepancy Analysis (KODA)}, a kernel method that constructs unified multi-modal kernels via modality-wise kernel multiplication and formulates discrepancy identification as an optimization problem that seeks components with high coherence under one embedding while constraining coherence under another. This formulation provides interpretable discrepancy structures associated with specific modality interactions. We establish finite-sample guarantees characterizing the effective reference sample size required for reliable analysis. To enable scalable computation in multi-modal settings, we develop a randomized low-dimensional approximation of joint kernels using random projections, including Random Fourier Features for shift-invariant kernels. Our empirical results indicate that KODA can identify consistent discrepancy structures across modalities.