FairSSL: Fair Multimodal Self-Supervised Learning
Abstract
Multimodal Self-Supervised Learning (SSL) has achieved remarkable success by learning representations from multiple views of data. However, prevalent methods rely on the redundancy assumption—that different views share substantial task-relevant information. We argue that this assumption fails in complex, real-world settings characterized by heterogeneity (e.g., variable-length healthcare or behavioral data), where enforcing strict alignment can discard unique, modality-specific signals and inadvertently amplify bias. In this work, we propose FairSSL, a framework that leverages data heterogeneity as a resource for fairness rather than a hindrance. Unlike standard contrastive approaches, FairSSL uses a subject-aware Variance-Invariance-Covariance Regularization objective, where alignment is enforced across segments drawn from the same subject. We introduce a segment-based pooling strategy to handle variable-length modalities, and we regularize representations to encourage (i) sufficient within-subject variability, (ii) cross-modal and cross-subject invariance, and (iii) representation decorrelation. Theoretical analysis shows that our objective bounds the score gap between protected groups. Empirically, FairSSL significantly outperforms existing baselines on heterogeneous multimodal datasets, improving fairness without sacrificing downstream predictive performance.