Stress Tests REVEAL Fragile Temporal and Visual Grounding in Video-Language Models
Abstract
Video-Language Models (VidLMs) achieve strong benchmark scores, yet these scores often hide whether models use the video at all. We show that VidLM failures follow two pathways: some visual signals are never reliably encoded, while others are encoded but overridden by model priors. We introduce REVEAL, a diagnostic stress-test benchmark for quantifying when and why VidLMs under-use visual evidence. REVEAL contains five controlled probes: camera-motion sensitivity, cross-frame integration, video sycophancy, language-only shortcuts, and temporal expectation bias. Together, they test whether models encode basic video signals, combine evidence across frames, and preserve visual evidence against user assertions, language cues, and learned event expectations. Across 11 VidLMs, we find systematic failures along both pathways. Under assertive prompts, several models produce near-identical outputs for real videos and random noise, making visual evidence effectively causally inert. We further carry out mechanistic probes to identify where these failures arise in the model pipeline and why visual evidence is lost. REVEAL provides a scalable, human-verified framework for moving beyond aggregate scores toward structured, reproducible evaluation of multimodal reliability.