Poster
in
Workshop: Combining Theory and Benchmarks: Towards A Virtuous Cycle to Understand and Guarantee Foundation Model Performance Fri, Jul 10, 2026 • 12:00 AM – 1:00 AM PDT

Stress Tests REVEAL Fragile Temporal and Visual Grounding in Video-Language Models

Sethuraman T V ⋅ Savya Khosla ⋅ Aditi Tiwari ⋅ Vidya Ganesh ⋅ Rakshana Jayaprakash ⋅ Aditya Jain ⋅ Vignesh Srinivasakumar ⋅ Onkar Susladkar ⋅ Joey Wang ⋅ Srinidhi Sunkara ⋅ Aditya Shanmugham ⋅ Abbaas Alif Mohamed Nishar ⋅ Rakesh Vaideeswaran Mahesh ⋅ Simon Jenni ⋅ Derek Hoiem

Project Page

Abstract

Video-Language Models (VidLMs) achieve strong benchmark scores, yet these scores often hide whether models use the video at all. We show that VidLM failures follow two pathways: some visual signals are never reliably encoded, while others are encoded but overridden by model priors. We introduce REVEAL, a diagnostic stress-test benchmark for quantifying when and why VidLMs under-use visual evidence. REVEAL contains five controlled probes: camera-motion sensitivity, cross-frame integration, video sycophancy, language-only shortcuts, and temporal expectation bias. Together, they test whether models encode basic video signals, combine evidence across frames, and preserve visual evidence against user assertions, language cues, and learned event expectations. Across 11 VidLMs, we find systematic failures along both pathways. Under assertive prompts, several models produce near-identical outputs for real videos and random noise, making visual evidence effectively causally inert. We further carry out mechanistic probes to identify where these failures arise in the model pipeline and why visual evidence is lost. REVEAL provides a scalable, human-verified framework for moving beyond aggregate scores toward structured, reproducible evaluation of multimodal reliability.