Correspondence Cognitive Learning for Multi-Modal Object Re-Identification
Abstract
Multi-modal object Re-Identification (ReID) aims to retrieve the same object across different modalities by exploiting their complementary visual information. Recent advances leverage Multi-modal Large Language Models (MLLMs) to generate descriptive textual annotations as auxiliary supervision. However, existing approaches usually adopt these generated texts directly, overlooking the varying correspondence degrees between visual and textual modalities. Such neglect may lead the model to treat strong- and weak-correspondence image–text pairs equally, limiting its ability to learn discriminative associations and hindering effective optimization. To overcome this limitation, we propose a Correspondence Cognitive Learning (CCL) framework that explicitly models the correspondence degree and facilitates a progressive learning process from easy to hard pairs. CCL is composed of two synergistic modules. The Correspondence-Guided Semantic Refinement (CGSR) module dynamically refines visual representations using text semantics according to the correspondence difficulty estimated from the previous epoch, thereby enhancing feature alignment under imperfect associations. The Cognitive-Driven Dynamic Optimization (CDDO) module presents a self-paced weighting mechanism that adaptively adjusts the optimization focus by emphasizing easy pairs at the early stage and gradually integrating harder ones as training evolves. Together, these modules enhance feature-level alignment and optimization adaptivity, yielding robust and discriminative multi-modal representations. Extensive experiments on three multi-modal object ReID benchmarks demonstrate the superior performance of our method.