IQA-Spider: Unifying Reasoning, Grounding, and Referring for Multi-Granularity Image Quality Assessment
Abstract
We present IQA-Spider, the first image quality assessment (IQA) framework that unifies reasoning, grounding, and referring within a LMM-based system for multi-granularity quality understanding. Existing LMM-based IQA methods typically support only partial perception dimensions, \egno, quality description and question answering~(\ieno, reasoning) or pixel-level grounding, largely due to the absence of (i) a unified task-and-data formulation and (ii) effective optimization paradigms for multi-granularity learning. To address these limitations, we formulate a rigorous four-task paradigm covering global and local quality description, pixel-level grounding, and region-level referring. Based on this formulation, we construct a corresponding IQA dataset with a scalable and automatic annotation pipeline, thereby providing a solid foundation for unified multi-granularity learning. To further enable unified perception, we adopts a conflict-free two-stage design that progressively extends textual multi-granularity understanding to pixel-level grounding: (i) the first stage equips the model with fine-grained textual reasoning across multiple IQA tasks, and (ii) the second stage introduces a training-free text-to-point grounding paradigm, which bridges textual semantics and pixel-level perception by mapping token logits to spatial coordinates. Based on these efforts, we achieve IQA-Spider with unified multi-granularity explainable image quality assessment. Extensive experiments across multiple benchmarks demonstrate strong performance, validating the effectiveness and versatility of the proposed formulation and framework. Code and datasets will be released upon acceptance.