Interactive Person Retrieval via Multi-Turn Multimodal Conversation
Abstract
Traditional text-based person retrieval approaches typically rely on single-shot textual queries, which are generally incomplete or vague in real-world scenarios. Recently, chat-based person retrieval methods enable iterative query refinement via question-answering interactions between the system and users. However, these methods fall short of direct user interaction with retrieved candidates during conversation, making it challenging to effectively refine the retrieval results. To address these limitations, we propose multimodal interactive person retrieval (MInterPR), a new retrieval paradigm that allows users to iteratively refine retrieved candidates by providing feedback on visual differences from the target person. To support this task, we establish MInterPEDES, a multimodal conversational dataset constructed by augmenting existing question-answering dialogues with synthesized visual feedbacks. Furthermore, to tackle the challenge of accurate and efficient semantics understanding in multimodal dialogues, we propose a multimodal conversational memory-enhanced framework MNEMO, which incorporates an atomic turn encoding (ATE) module to model each dialogue turn independently, and a dialogue memory aggregation (DMA) module to capture the fine-grained information and cross-turn dependencies. Extensive experiments demonstrate that MNEMO achieves substantial improvements in both retrieval accuracy and generalization ability, highlighting its promising potential in real-world scenarios. The code and dataset will be released to facilitate future research.