Refined and Enriched Physics-based Captions for Unseen Dynamic Changes
Abstract
Vision-Language models (VLMs), i.e., image-textpairs of CLIP, have boosted image-based DeepLearning (DL). Unseen images by transferring semanticknowledge from seen classes can be dealtwith with the help of language models pre-trainedonly with texts. Two-dimensional spatial relationshipsand a higher semantic level have beenperformed. Moreover, Visual-Question-Answer(VQA) tools and open-vocabulary semantic segmentationprovide us with more detailed scenedescriptions, i.e., qualitative texts, in captions.However, the capability of VLMs presents stillfar from that of human perception. This paperproposes PanopticCAP for refined and enrichedqualitative and quantitative captions to make themcloser to what human recognizes by combiningmultiple DLs and VLMs. In particular, captionswith physical scales and objects’ surface propertiesare integrated by counting, visibility distance,and road conditions. Fine-tuned VLM models arealso used. An iteratively refined caption modelwith a new physics-based contrastive loss functionis used. Experimental results using images withadversarial weather conditions, i.e., rain, snow,fog, landslide, flooding, and traffic events, i.e.,accidents, outperform state-of-the-art DLs andVLMs. A higher semantic level in captions forreal-world scene descriptions is shown.