Poster
in
Workshop: DMLR Workshop: Data-centric Machine Learning Research
PhysicsCAP: Natural Scene Understanding By Semantic Segmentation, CLIP And Physical Models Through Refined and Enriched Captions
Hidetomo Sakaino
Vision-Language models, i.e., image-text pairs of CLIP, have boosted image-based Deep Learning (DL). Unseen images by transferring semantic knowledge from seen classes, can be dealt with with the help of language models pre-trained only with texts. Two-dimensional spatial relationships and a higher semantic level have been performed. Moreover, Visual-Question-Answer (VQA) tools and open-vocabulary semantic segmentation provide us with more detailed scene descriptions, i.e., qualitative texts, in captions. However, the capability of VLMs presents still far from that of human perception. Captions from state-of-the-art (SOTA) VLMs have not contained physical scales from images. Prepositions in captions play a role in knowing the relative positions of objects, i.e., "left" and "on". However, in addition to such two-dimensional clues, three-dimensional clues of "far" may be more helpful. Therefore, physical scales are needed for better natural scene understanding. For example, visibility affects traffic flow and control on city roads, highways, and runways. Visibility distance or level is an important measure for predicting the risk on the road. However, only a few papers have tackled such nighttime vision with visibility estimation. This paper proposes PhysicsCAP for refined and enriched qualitative and quantitative captions to make them closer to what human recognizes by combining multiple DLs and VLMs. In particular, captions with physical scales and objects’ surface properties are integrated by water level, counting, depth map, visibility distance, and road conditions. Fine-tuned VLM models are also used. An iteratively refined caption model with a new physics-based contrastive loss function is used. Experimental results using images with adversarial weather conditions, i.e., rain, snow, fog, landslide, flooding, and traffic events, i.e., accidents, outperform SOTA DLs and VLMs. A higher semantic level in captions for real-world scene descriptions is shown.