IVQA-LD: Inclusive Multimodal Understanding for Population with Limb-Deficiency
Abstract
People with limb differences often face significant challenges in accessing inclusive AI services, largely due to the lack of structured, high-quality resources centered on disability contexts. In this work, we introduce a limb-deficiency aware body-centric learning and evaluation paradigm that involves (i) a large-scale limb-aware vision–language dataset and evaluation benchmark for multimodal reasoning, and (ii) a model adaptation strategy for Vision-Language Models (VLM) in limb-difference contexts. Specifically, we first collect limb-difference data covering all eight limb-deficiency types across diverse real-world scenarios. The data are systematically organized into 96 limb-affected human action categories and 68 medical-functional classes defined by the World Health Organization (WHO). Then, we curate an expert-annotated vision–language dataset for limb-aware multimodal understanding, named Inclusive VQA for Limb Deficiency (IVQA-LD). IVQA-LD comprises 80K VQA pairs spanning eight core tasks including visual grounding, quantitative reasoning, functional semantic classification, and instructional text generation. We benchmark state-of-the-art VLMs on IVQA-LD and find that they consistently struggle across all tasks, exposing substantial deficiencies in limb-aware perception and reasoning. To address this, we further propose a Body-centric Structure-aware Initialization (BSI) strategy that aligns model representations with limb-specific semantics. With BSI, VLMs fine-tuned on IVQA-LD achieve significant performance improvements across all the tasks. We will publicly release the dataset to support future research.