NAVIGATE: Evaluating Visual-Guided Search Decision-Making on the Open Web
Abstract
Vision–Language Models (VLMs) are increasingly deployed with web search tools, yet we still lack benchmarks that isolate a critical capability for real-world use: deciding when to search and how to steer search from ambiguous visual evidence, especially when multiple images provide overlapping or conflicting cues. We introduce NAVIGATE, a novel benchmark centered on images as primary evidence for open-web search planning and multi-step reasoning. It contains 500 questions across 20 domains and spans three difficulty tiers, from single-image, self-contained problems to multi-image joint search and multi-domain composition. Unlike prior benchmarks that specify explicit search targets, NAVIGATE evaluates search decision-making: models must infer whether external search is necessary and iteratively refine search directions based on holistic reasoning over visual cues. Across a broad set of VLMs and search-enabled systems, performance remains low, Gemini-3-Pro-Preview-Search reaches only 36.4% accuracy, highlighting persistent failures in cross-image grounding, search triggering, and search strategy coordination. We will release NAVIGATE publicly.