A Graph Foundation Model with Cross-Modal Alignment and Modality-Aware Expert Fusion for Multi-Modal Graphs
Abstract
Graph Foundation Models (GFMs) aim to learn universal patterns through large-scale pretraining on diverse graphs and generalize to open-world scenarios. While GFMs have garnered significant attention, existing works primarily focus on sigle-modal graphs. However, many real-world graphs are multimodal, consisting of structures alongside diverse features derived from modalities such as text and images. To date, exploration into Multimodal Graph Foundation Models (MGFMs) remains limited. Incorporating multimodal data provides a more comprehensive view, allowing models to learn richer semantics, thereby advancing GFMs. We are therefore motivated to explore MGFMs, where the core challenge lies in synergistically encoding structures and multimodal features to achieve effective cross-modal alignment and fusion. To this end, we propose a graph foundation model with Cross-modal Alignment and Modality-aware Expert fusion, CAME. Specifically, CAME first generates graph embeddings for each individual modality. We then introduce a multimodal multi-expert encoding mechanism, which includes a dimension-wise routing strategy to fuse multimodal information. Finally, we employ a cross-modal contrastive loss to train CAME, enabling the adaptive alignment and fusion across different modalities. Extensive experiments demonstrate the effectiveness of CAME across multiple tasks and diverse multimodal graph datasets.