Intra-Modal Neighbors Never Lie: Rectifying Inter-Modal Noisy Correspondence via Graph-Based Intra-Modal Reasoning
Yang Liu ⋅ Wentao Feng ⋅ Shudong Huang ⋅ Yalan Ye ⋅ Jiancheng Lv
Abstract
Large-scale web-harvested datasets have fueled the progress of cross-modal retrieval but inevitably suffer from \textit{noisy correspondence}, which severely degrades model generalization. Existing methods primarily address this by filtering out noise or seeking a substitute label, yet they predominantly remain bound by a ``Discrete Selection'' paradigm. We argue that relying on a single discrete proxy induces \textit{Single-Point Fragility} and \textit{Discretization Error}. To overcome these limitations, we propose a novel framework, \textbf{Intra-modal Neighbor-based Rectification (IN$^2$R)}, which shifts the paradigm from searching for a substitute to \textit{synthesizing} a reliable supervision target. Leveraging the intrinsic geometric stability of intra-modal data, IN$^2$R employs a \textbf{Graph Refiner} to perform relational reasoning over neighbors retrieved from a dynamic \textbf{Cross-Model Memory}. Instead of propagating discrete labels, our method synthesizes a continuous, soft prototype that reflects the consensus of the local semantic neighborhood, effectively rectifying inter-modal misalignment. Extensive experiments on Flickr30K, MS-COCO, and CC152K demonstrate that IN$^2$R significantly outperforms state-of-the-art methods.
Successful Page Load