Cross-Modal Semantic Decoupling and Transfer for Text-to-Visible-Infrared Person Re-Identification
Abstract
Text-to-Image Person Re-Identification (TI-ReID) retrieves visible pedestrian images using text queries. Yet in low-light or nighttime settings, visible images lack sufficient identity details, while infrared images effectively capture pedestrian contours and textures. To enable all-day surveillance, we propose a dual cross-modal retrieval task called Text-to-Visible-Infrared Re-Identification (TVI-ReID) and construct corresponding tri-modal datasets. Compared to TI-ReID, TVI-ReID faces two key challenges: (1) complex hybrid discrepancies in dual cross-modal retrieval from three modalities, and (2) semantic inconsistency between pretraining and downstream tasks. To address these issues, we propose a Cross-Modal Semantic Decoupling and Transfer (CSDT) framework. CSDT constructs color-related and color-irrelevant feature subspaces via Semantic Decoupling Learning (SDL) to align shared semantics across text and dual image modalities, reducing hybrid discrepancies. Moreover, Semantic Distribution Transfer (SDT) adapts pretrained text-visible alignment to text-infrared matching. Extensive experiments on tri-modal datasets show our approach outperforms existing state-of-the-art TI-ReID methods. Datasets and code will be released publicly.