PixCLIP: Towards Fine-grained Vision-Language Understanding via Any-granularity Pixel-Text Alignment
Abstract
While CLIP has achieved strong performance across vision–language tasks, fine-grained image–text alignment remains challenging. Recent efforts improve textual granularity by leveraging long, detailed descriptions and replacing CLIP’s text encoder with LLM, but often overlook the visual-side bottleneck: achieving finer alignment requires region- and pixel-level visual grounding, not just finer text. To address this issue, we propose PixCLIP, a framework that jointly enhances both sides by accommodating visual prompt regions and long-form text within a unified training objective. Firstly, to support training at this granularity, we develop an automated annotation pipeline that produces long-form descriptions with pixel-level localization, and use it to construct LongGRIT, a high-quality dataset with nearly 1.5M samples. Furthermore, we introduce a three-branch pixel–text alignment framework that aligns image regions with corresponding textual descriptions across multiple granularities. Experiments show that PixCLIP achieves state-of-the-art performance on pixel- and region-level alignment tasks while preserving strong results on standard global image–text retrieval benchmarks, even with arbitrarily shaped region prompts and long compositional texts.