Q-CLIP: Unleashing the Power of Vision-Language Models for Video Quality Assessment through Unified Cross-Modal Adaptation
Abstract
Accurate and efficient Video Quality Assessment (VQA) has long been a key research challenge. Current mainstream VQA methods typically improve performance by pretraining on large-scale classification datasets, followed by fine-tuning on VQA datasets. However, this strategy presents two significant challenges: (1) merely transferring semantic knowledge learned from pretraining is insufficient for VQA, as video quality depends on multiple factors (e.g., semantics, distortion, motion); (2) pretraining on large-scale datasets demands enormous computational resources, often dozens to hundreds of times more than training on VQA datasets. Recently, Contrastive Vision-Language Models (CVLMs) have shown strong generalization across visual tasks and promising potential for quality assessment. In this work, we propose Q-CLIP, the first fully CVLMs-based framework for VQA. Q-CLIP enhances both visual and textual representations through a Shared Cross-Modal Adapter (SCMA), which contains only a minimal number of trainable parameters and is the only component that requires training. This design significantly reduces computational cost. In addition, we introduce a set of five learnable quality-level prompts to guide the CVLMs in perceiving subtle quality variations. Furthermore, we investigate the impact of different frame sampling strategies on VQA performance. Extensive experiments demonstrate that Q-CLIP exhibits excellent performance on several VQA datasets. Code is provided in the supplementary material.