Position: Multiplicity is an Inevitable and Inherent Challenge in Multimodal Learning
Abstract
Multimodal learning has seen remarkable progress, particularly with large-scale pre-training across various modalities. Most current approaches are built on the assumption of a deterministic one-to-one alignment between modalities. However, this oversimplifies real-world multimodal relationships, where their nature is inherently many-to-many. The many-to-many property, or \emph{multiplicity}, is not a side-effect of noise or annotation error, but an inevitable outcome of intra-modal variability, representational asymmetry, and task-dependent ambiguity in multimodal tasks. We argue that multiplicity is a fundamental bottleneck that affects all stages of the multimodal learning pipeline: from data construction to model training and evaluation benchmarks. By formalizing its causes and consequences, we demonstrate how ignoring multiplicity leads to training uncertainty, unreliable evaluation, and degraded dataset quality. This position paper calls for new research directions on multimodal learning, including multiplicity-aware learning frameworks and dataset construction and evaluation protocols.