Position: Video LLMs Must Not Ignore the Pixel Dynamics in Plain Sight
Abstract
The essence of video lies in pixel dynamics: motion, state transitions, and the flow of visual information across frames. Video Large Language Models (LLMs) have rapidly become the dominant paradigm for video understanding in computer vision, sophisticated multimodal reasoning over complex, long-form visual streams. In this position paper, we argue that recent progress in video understanding is measured by benchmarks and protocols that can be solved without reliably perceiving spatiotemporal evidence, rewarding language-driven plausibility over video-grounded inference. We identify two coupled failure modes that consistently emerge across recent Video LLM evaluations: (i) static-cue dominance, where appearance and context outweigh spatiotemporal evidence, and (ii) prior-driven temporal hallucination, where learned regularities fill in temporal and causal structure when dynamics are subtle or counterintuitive. We synthesize recent diagnostic probes that expose these failure modes into a call to action for the community: to re-center video understanding on what a video uniquely contains, namely, dynamic evidence that unfolds over time, by enforcing spatiotemporal grounding in both models and benchmarks, before the pixel dynamics are lost in plain sight.