HieRD: Hierarchical Relational Distillation for Vision-Language Embedding Models
Abstract
Knowledge distillation is crucial for compressing large Vision–Language Models (VLMs) into efficient architectures. While prior VLM research has primarily focused on reasoning tasks like visual question answering, multimodal embedding learning, a key component for large-scale retrieval, has received comparatively less attention. Existing distillation methods typically align static global representations, overlooking hierarchical feature structure and fine-grained cross-modal interactions. This leads to a structural gap where student models fail to inherit object-level semantics and spatial relationships from teachers. To address this limitation, we propose HieRD, a Hierarchical Representation Distillation framework that preserves hierarchical structure within and across modalities throughout the distillation process by leveraging clustered visual tokens and multi-granular alignment with phrase-level text. Experimental results on multimodal embedding and downstream tasks show that HieRD consistently outperforms strong baselines, reflecting the effectiveness of its fine-grained semantic and spatial modeling, while enabling compact and efficient embedding models.