Poster
in
Workshop: Knowledge and Logical Reasoning in the Era of Data-driven Learning
What’s left can’t be right - The remaining positional incompetence of contrastive vision-language models
Nils Hoehing · Ellen Rushe · Anthony Ventresque
Abstract:
Contrastive vision-language models like CLIPhave been found to lack spatial understandingcapabilites. In this paper we discuss the possi-ble causes of this phenomena by analysing bothdatasets and embedding space. By focusing onsimple left-right positional relations, we show thatthis behaviour is entirely predictable, even withlarge-scale datasets, demonstrate that these rela-tions can be taught using synthetic data and showthat this approach can generalise well to naturalimages - improving the performance on left-rightrelations on Visual Genome Relationships. Thecode for all our experiments and analysis can befound on GitHub.
Chat is not available.