Geometry-based Schrödinger Bridges for Trustworthy Multimodal Fusion
Abstract
Real-world multimodal systems must be robust against low-quality data, such as sensor noise, incomplete multimodal data and conflicting inputs. However, existing trustworthy fusion methods rely on the model's own prediction confidence to judge data quality. This creates a circular dependency: when a model is confident but wrong (overconfident), these methods fail to detect the error. To break this loop, we propose Geometry-based Multimodal Fusion (GMF). Instead of relying on predictions, we evaluate reliability by measuring the physical effort required to map input data back to the valid data manifold. We implement this using Diffusion Schrödinger Bridges with Rectified Flow, which allows us to calculate Transport Energy as a direct metric for quality. The logic is simple: valid data sits on the manifold (low energy), while noisy, incomplete data or conflicting data requires high energy to be restored. This geometric metric acts as an independent judge, effectively flagging unreliable inputs even when the classifier is fooled. Extensive experiments demonstrate that GMF significantly improves robustness against severe sensor noise and semantic conflicts compared to confidence-based baselines.