CAMP: Coherent Alignment of Multimodal Prototypes for Explainable Complementary Learning
Alvaro Lopez Pellicer ⋅ Eoin M Kenny ⋅ Simran Lamba ⋅ Shubham Sharma ⋅ Plamen Angelov ⋅ Saumitra Mishra
Abstract
Most multimodal learning assumes redundant views (such as image–caption pairs), yet many applications require combining complementary modalities that provide distinct evidence (such as an X-ray and medical history). We term this setting *Complementary Multimodal Classification* (CMC). In CMC, existing explainable-by-design methods often force an accuracy–interpretability trade-off because single shared similarity metrics fail under asymmetric, class-conditional evidence. To address this, we propose Coherent Alignment of Multimodal Prototypes (CAMP). CAMP enforces coherent multimodal reasoning by aligning class-wise evidence via optimal transport and imposing geometric constraints to counter modality dominance and representation collapse. We provide theoretical guarantees showing that these mechanisms eliminate such degeneracies without restricting expressivity. Empirically, across 17 public CMC datasets, CAMP matches or exceeds large ($>$100M parameter) AutoML baselines with fewer than 1M trainable parameters, and when fine-tuned end-to-end it achieves state-of-the-art performance. To the best of our knowledge, this work is the first modality-agnostic prototype-learning framework designed for complementary multimodal tasks.
Successful Page Load