Poster

ZeroBench: An Impossible Visual Benchmark for Contemporary Large Multimodal Models

Jonathan Roberts ⋅ Mohammad Reza Taesiri ⋅ Ansh Sharma ⋅ Akash Gupta ⋅ Samuel Roberts ⋅ Ioana Croitoru ⋅ Vlad Bogolin ⋅ Jialu Tang ⋅ Florian Langer ⋅ Vyas Raina ⋅ Vatsal Raina ⋅ Hanyi Xiong ⋅ Vishaal Udandarao ⋅ Jingyi Lu ⋅ Chen Shiyang ⋅ Sam Purkis ⋅ Tianshuo Yan ⋅ Wenye Lin ⋅ Gyungin Shin ⋅ Qiaochu Yang ⋅ Anh Nguyen ⋅ David Atkinson ⋅ Alexandru Coca ⋅ Mikah Đặng ⋅ Sebastian Dziadzio ⋅ Jakob D. Kunz ⋅ Kaiqu Liang ⋅ Alexander Lo ⋅ Brian Pulfer ⋅ Steven Walton ⋅ Charig Yang ⋅ Kai Han ⋅ Samuel Albanie

Abstract

Large Multimodal Models (LMMs) exhibit shortfalls when interpreting images and, by some measures, have poorer spatial cognition than young children or animals. Despite this, they attain high scores on many popular visual benchmarks, with headroom rapidly eroded by surging model progress. To address this, there is a pressing need for difficult benchmarks that remain relevant for longer. We take this idea to its limit by introducing ZeroBench—a lightweight visual reasoning benchmark curated using adversarial filtering to be “impossible” for frontier LMMs at release time, with initial SotA scores of 0% pass@1 and pass∧5. We track progress on ZeroBench over the subsequent year, observing SotA reaching 6% pass∧5 and 19% pass@5, indicating the potential longevity of our benchmark. Overall, we evaluate 46 LMMs on ZeroBench, compare performance to a human baseline, analyse strengths and weaknesses, and chart performance over a year of advancement in visual capabilities.