Skip to yearly menu bar Skip to main content


Poster

Detecting Any instruction-to-answer interaction relationship:Universal Instruction-to-Answer Navigator for Med-VQA

Zhongze Wu · Hongyan Xu · Yitian Long · Shan You · Xiu Su · Jun Long · Yueyi Luo · Chang Xu


Abstract:

Medical Visual Question Answering (Med-VQA) interprets complex medical imagery using user instructions for precise diagnostics, yet faces challenges due to diverse, inadequately annotated images. In this paper, we introduce the Universal Instruction-Vision Navigator (Uni-Med) framework for extracting instruction-to-answer relationships. Specifically, we design the Instruct-to-Answer Clues Interpreter (IAI) to mark the "real intent" of instructions and generate visual explanations based on the answers. The IAI-Med VQA dataset, produced using IAI, is now publicly available to advance Med-VQA research. Additionally, our Token-Level Cut-Mix module dynamically aligns visual explanations with image patches, ensuring answers are traceable and learnable. We also implement intention-guided attention to minimize non-core instruction interference, sharpening focus on 'real intent'. Extensive experiments on SLAKE datasets show Uni-Med’s superior accuracies (87.52% closed, 86.12% overall), outperforming MedVInT-PMC-VQA by 1.22% and 0.92%. Code and dataset are available at: https://anonymous.4open.science/r/Uni-Med-9237.

Live content is unavailable. Log in and register to view live content