Poster
in
Workshop: Data-centric Machine Learning Research (DMLR): Datasets for Foundation Models
CONTEXTUAL: Evaluating Context-Sensitive Text-Rich Visual Reasoning in Large Multimodal Models
Rohan Wadhawan · Hritik Bansal · Kai-Wei Chang · Nanyun Peng
In the real world, many tasks require joint reasoning over the text and visual elements in the image (e.g., navigating in public spaces), which we refer to as context-sensitive text-rich visual reasoning. Specifically, these tasks require an understanding of the context in which the text interacts with visual elements within an image. Due to the lack of existing datasets for this task, we introduce ConTextual, a novel dataset featuring human-crafted instructions that require context-sensitive reasoning for text-rich images. We conduct experiments to assess the performance of 14 foundation models (GPT-4V, Gemini-Pro-Vision, LLaVA-Next) and establish a human performance baseline. Further, we perform human evaluation of the model responses and observe a significant performance gap of 30.8% between the best-performing LMM, GPT-4V, and human performance baseline. Our fine-grained analysis reveals that GPT-4V encounters difficulties interpreting time-related data and infographics. However, it demonstrates proficiency in comprehending abstract visual contexts such as memes and quotes. Finally, our qualitative analysis uncovers various factors contributing to poor performance including lack of precise visual perception and hallucinations. Our dataset, code and leaderboard can found on the project page https://con-textual.github.io/.