Object-level Semantic and Spatial Distillation for Open Vocabulary Detection
Abstract
Recent Open-vocabulary Object Detection (OVD) approaches adapt CLIP through region-level distillation to improve semantic alignment for novel categories. However, the distilled regional features are often used for both classification and localization, enhancing semantic consistency at the expense of spatial fidelity. To resolve this, we propose Object-level Semantic and Spatial Distillation (OSSD), a two-stage framework that explicitly decouples semantic and spatial feature learning. OSSD first distills object-level semantics from CLIP’s global [CLS] embeddings to enhance region discrimination, and then injects fine-grained spatial and structural priors via spatial distillation from a detector trained only on COCO base categories. Furthermore, we propose a Location Quality Estimation Head (LQEH) that predicts class-agnostic localization quality, complementing objectness confidence to improve the novel-object perception. Extensive experiments show that our method achieves 49.2 AP50 on the OV-COCO benchmark. exceeding the best previous result by 3.6\%, On the OV-LVIS benchmark, our method reaches 40.5 mAP on novel categories, outperforming previous state-of-the-art methods.