Spotlight
in
Workshop: Accessible and Efficient Foundation Models for Biological Discovery
One-Versus-Others Attention: Scalable Multimodal Integration for Biomedical Data
Michal Golovanevsky · Eva Schiller · Akira Nair · Ritambhara Singh · Carsten Eickhoff
Keywords: [ Multimodal Learning ] [ clinical decision support ] [ Deep Learning ] [ Biomedical Data ] [ scalability ]
Multimodal models have become increasingly important as they surpass single-modality approaches on diverse tasks ranging from question-answering to autonomous driving. Despite the importance of multimodal learning, existing efforts focus on vision-language applications, where the number of modalities rarely exceeds four (images, text, audio, video). However, data in other domains, such as healthcare, may include many more modalities like X-rays, PET scans, MRIs, genetic screening, genomic data, and clinical notes, creating a need for both efficient and accurate data integration. Many multimodal foundation models rely on cross-attention or self-attention for effective data integration, which do not scale well for applications with more than two modalities. The complexity per layer of computing attention in either paradigm is, at best, quadratic with respect to the number of modalities, posing a computational bottleneck that impedes broad adoption. To address this, we propose a new attention mechanism, One-Versus-Others (OvO) attention, that scales linearly with the number of modalities, thus offering a significant reduction in computational complexity compared to existing multimodal attention methods. Using three biomedical datasets with diverse modalities, we show that our method decreases computation costs while increasing performance compared to popular integration techniques. Across all datasets, OvO reduced the number of required floating point operations (FLOPs) by at least 91.98\%, demonstrating its significant impact on efficiency and enabling wider adaptation.