ZeroBench: An Impossible Visual Benchmark for Contemporary Large Multimodal Models
Abstract
Large Multimodal Models (LMMs) exhibit shortfalls when interpreting images and, by some measures, have poorer spatial cognition than young children or animals. Despite this, they attain high scores on many popular visual benchmarks, with headroom rapidly eroded by surging model progress. To address this, there is a pressing need for difficult benchmarks that remain relevant for longer. We take this idea to its limit by introducing ZeroBench—a lightweight visual reasoning benchmark curated using adversarial filtering to be “impossible” for frontier LMMs at release time, with initial SotA scores of 0% pass@1 and pass∧5. We track progress on ZeroBench over the subsequent year, observing SotA reaching 6% pass∧5 and 19% pass@5, indicating the potential longevity of our benchmark. Overall, we evaluate 46 LMMs on ZeroBench, compare performance to a human baseline, analyse strengths and weaknesses, and chart performance over a year of advancement in visual capabilities.