Scalable Medical Multimodal Fusion via Symmetric Consistency Modeling
Abstract
Medical diagnosis tasks often rely on heterogeneous information from multiple sources, such as medical images and clinical text. Multimodal fusion is therefore essential for improving classification performance and robustness. However, most existing methods assume a fixed and known modality set, making them less effective when the number or composition of modalities changes. To address this limitation, we propose a modality-agnostic medical multimodal fusion framework that can naturally accommodate an arbitrary number of input modalities. At the coarse-grained modality level, we represent each modality’s estimation of latent semantics as an uncertainty-aware probability distribution, and impose symmetric consistency constraints to achieve global cross-modal semantic alignment. At the fine-grained token level, we further design a token-level consistency constraint based on linear reconstruction. This constraint enables structured mutual verification of local semantics across modalities. Finally, for multimodal fusion, we adopt a multi-view consistency strategy to obtain a unified representation for diagnosis prediction. In particular, each modality is sequentially treated as a conditional view to suppress noise in the remaining modalities and extract shared semantics. Extensive experiments on five public and self-constructed multimodal medical datasets demonstrate the effectiveness and scalability of the proposed approach. Code is available at https://github.com/gjhgjbkg/SMMF.