Skip to yearly menu bar Skip to main content


Poster
in
Workshop: 2nd ICML Workshop on New Frontiers in Adversarial Machine Learning

Refined and Enriched Physics-based Captions for Unseen Dynamic Changes

Hidetomo Sakaino

Keywords: [ VQA ] [ enrichment ] [ vision-language ] [ physical scale ] [ disaster ] [ weather ] [ caption ] [ dynamic ] [ adversarial ]


Abstract:

Vision-Language models (VLMs), i.e., image-textpairs of CLIP, have boosted image-based DeepLearning (DL). Unseen images by transferring semanticknowledge from seen classes can be dealtwith with the help of language models pre-trainedonly with texts. Two-dimensional spatial relationshipsand a higher semantic level have beenperformed. Moreover, Visual-Question-Answer(VQA) tools and open-vocabulary semantic segmentationprovide us with more detailed scenedescriptions, i.e., qualitative texts, in captions.However, the capability of VLMs presents stillfar from that of human perception. This paperproposes PanopticCAP for refined and enrichedqualitative and quantitative captions to make themcloser to what human recognizes by combiningmultiple DLs and VLMs. In particular, captionswith physical scales and objects’ surface propertiesare integrated by counting, visibility distance,and road conditions. Fine-tuned VLM models arealso used. An iteratively refined caption modelwith a new physics-based contrastive loss functionis used. Experimental results using images withadversarial weather conditions, i.e., rain, snow,fog, landslide, flooding, and traffic events, i.e.,accidents, outperform state-of-the-art DLs andVLMs. A higher semantic level in captions forreal-world scene descriptions is shown.

Chat is not available.