Cross-Modal Knowledge Distillation without Paired Data: Theoretical Foundations and Algorithms
Abstract
Cross-modal knowledge distillation (CMKD) aims to transfer knowledge from a teacher model in one modality to a student model in another modality. Existing CMKD methods have demonstrated strong empirical performance when paired multimodal data with aligned semantics are available, but such paired data are often costly or infeasible to obtain. This paper studies CMKD in the more challenging and practically relevant setting of unpaired data. We establish a distributional relationship between teacher and student models under cross-modal distillation and characterize two fundamental quantities that underlie effective knowledge transfer: feature and label alignments. These quantities capture semantic discrepancy between modalities at the level of representation distributions and prediction distributions, respectively. Guided by this theoretical insight, we propose a principled framework, with theoretical guarantees, that enables effective cross-modal knowledge distillation by aligning distributions rather than individual samples, thereby eliminating the need for data-level pairing. Extensive experiments across a wide range of multimodal benchmarks show that our framework is highly effective in both unpaired and paired data settings, improving significantly over prior work.