Skip to yearly menu bar Skip to main content


Oral
in
Workshop: 2nd ICML Workshop on New Frontiers in Adversarial Machine Learning

Refined and Enriched Physics-based Captions For Unseen Dynamic Changes

Keywords: [ adversarial ] [ dynamic ] [ caption ] [ weather ] [ disaster ] [ physical scale ] [ vision-language ] [ enrichment ] [ VQA ]


Abstract:

Vision-Language models (VLMs), i.e., image-textpairs of CLIP, have boosted image-based DeepLearning (DL). Unseen images by transferring se-mantic knowledge from seen classes can be dealtwith with the help of language models pre-trainedonly with texts. Two-dimensional spatial rela-tionships and a higher semantic level have beenperformed. Moreover, Visual-Question-Answer(VQA) tools and open-vocabulary semantic seg-mentation provide us with more detailed scenedescriptions, i.e., qualitative texts, in captions.However, the capability of VLMs presents stillfar from that of human perception. This paperproposes PanopticCAP for refined and enrichedqualitative and quantitative captions to make themcloser to what human recognizes by combiningmultiple DLs and VLMs. In particular, captionswith physical scales and objects’ surface proper-ties are integrated by counting, visibility distance,and road conditions. Fine-tuned VLM models arealso used. An iteratively refined caption modelwith a new physics-based contrastive loss functionis used. Experimental results using images withadversarial weather conditions, i.e., rain, snow,fog, landslide, flooding, and traffic events, i.e.,accidents, outperform state-of-the-art DLs andVLMs. A higher semantic level in captions forreal-world scene descriptions is shown.

Chat is not available.