MAGIC: Multi-Granularity Language-Informed Image Clustering
Abstract
Image clustering is a fundamental unsupervised task in computer vision. Recent studies have explored incorporating external linguistic information to facilitate visual feature learning and thereby enhance clustering performance. Nevertheless, these methods typically rely on fixed vocabularies (e.g., WordNet) to generate language counterparts, leading to inter-modal semantic misalignment due to granularity discrepancy between visual and textual semantics. Moreover, they often overlook the issue of intra-modal semantic redundancy caused by task-irrelevant knowledge. To address these challenges, we propose a new Multi-grAnularity lanGuage-informed Image Clustering method, dubbed MAGIC. To reduce semantic misalignment, we first prompt the vision-language models to generate multi-granularity language descriptions that capture rich image semantics, which are then integrated for effective multi-modal alignment. To alleviate semantic redundancy, we design modality-specific semantic adapters that adaptively refine and compress the semantically dense features into clustering-friendly representations under task guidance. A consensus representation is obtained by fusing the refined visual and textual features, which acts as a teacher to guide image clustering through a robust contrastive learning framework. Extensive experiments on benchmarks demonstrate that MAGIC outperforms state-of-the-art methods.