ICML Poster Detecting Any instruction-to-answer interaction relationship:Universal Instruction-to-Answer Navigator for Med-VQA

Poster

Detecting Any instruction-to-answer interaction relationship:Universal Instruction-to-Answer Navigator for Med-VQA

Zhongze Wu · Hongyan Xu · Yitian Long · Shan You · Xiu Su · Jun Long · Yueyi Luo · Chang Xu

Hall C 4-9 #2508

[ Abstract ] [ Paper PDF ]

[ Slides] [ Poster]

Wed 24 Jul 2:30 a.m. PDT — 4 a.m. PDT

Abstract:

Medical Visual Question Answering (Med-VQA) interprets complex medical imagery using user instructions for precise diagnostics, yet faces challenges due to diverse, inadequately annotated images. In this paper, we introduce the Universal Instruction-Vision Navigator (Uni-Med) framework for extracting instruction-to-answer relationships, facilitating the understanding of visual evidence behind responses. Specifically, we design the Instruct-to-Answer Clues Interpreter (IAI) to generate visual explanations based on the answers and mark the core part of instructions with "real intent" labels. The IAI-Med VQA dataset, produced using IAI, is now publicly available to advance Med-VQA research. Additionally, our Token-Level Cut-Mix module dynamically aligns visual explanations with image patches, ensuring answers are traceable and learnable. We also implement intention-guided attention to minimize non-core instruction interference, sharpening focus on 'real intent'. Extensive experiments on SLAKE datasets show Uni-Med’s superior accuracies (87.52% closed, 86.12% overall), outperforming MedVInT-PMC-VQA by 1.22% and 0.92%. Code and dataset are available at: https://github.com/zhongzee/Uni-Med-master.

Chat is not available.