Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Knowledge and Logical Reasoning in the Era of Data-driven Learning

What’s left can’t be right - The remaining positional incompetence of contrastive vision-language models

Nils Hoehing · Ellen Rushe · Anthony Ventresque


Abstract:

Contrastive vision-language models like CLIPhave been found to lack spatial understandingcapabilites. In this paper we discuss the possi-ble causes of this phenomena by analysing bothdatasets and embedding space. By focusing onsimple left-right positional relations, we show thatthis behaviour is entirely predictable, even withlarge-scale datasets, demonstrate that these rela-tions can be taught using synthetic data and showthat this approach can generalise well to naturalimages - improving the performance on left-rightrelations on Visual Genome Relationships. Thecode for all our experiments and analysis can befound on GitHub.

Chat is not available.