Video-BCI: Bayesian Cognitive Integration of Self-Prior Hypotheses for Video Understanding
Abstract
Recent progress in vision-language models (VLMs) has driven significant advances in video understanding. However, existing methods often act as naive empiricists, mapping video input directly to output without any mechanism to introspect or challenge inherent bias. In this work, we challenge this paradigm by reframing video reasoning as a Bayesian cognitive process. We propose Video-BCI (Bayesian Cognitive Integration of Self-Prior Hypotheses), a novel framework that first samples a set of Self-Prior Hypotheses to represent the model's intuitive yet potentially biased cognitive state, and then guides the VLMs to perform a critical integration of these priors. This process encourages the model to challenge erroneous majority consensus in cases of high information divergence and to distill superior reasoning chains from its own prior space. The integration is driven by a composite Cognitive Utility Function comprising two intrinsic learning signals: Dialectical Uncertainty Signal (DUS) and Process Tracing Signal (PTS). The DUS incentivizes correct, non-majority judgments by quantifying both the conflict (entropy) among priors and their consensus-challenging strength. The PTS guides the model to trace and learn from reasoning paths within its own priors that lead to better answers, enabling self-driven procedural knowledge distillation. Extensive experiments on six mainstream benchmarks show that Video-BCI achieves new state-of-the-art (SOTA) results across the board. For example, it surpasses the previous best on the MMVU benchmark by 3.8%. Our code will be made publicly available.