Memory as Dynamics: Learning Reliability-Guided Predictive Models for Online Video Perception
Abstract
Predictive memory has recently emerged as a powerful mechanism for online video models, enabling temporal reasoning beyond static memory banks. However, we observe a paradoxical phenomenon in which predictive memory often exacerbates drift under occlusion or rapid motion, as inaccurate predictions contaminate the internal state and lead to irreversible identity loss. We identify this failure as a reliability mismatch: generative predictive dynamics are applied uniformly, even when their uncertainty is high and observational evidence is weak. To address this issue, we reinterpret video memory as a dynamic latent process rather than a static buffer. Building on this insight, we introduce Reliability-guided Predictive Memory (RPM), a framework that explicitly regulates when and how predictive dynamics should influence online video perception. RPM integrates a latent world model based on state-space latent dynamics to generate predictive priors, while employing a reliability-aware fusion policy that adaptively suppresses unreliable predictions during challenging scenarios such as occlusion and re-acquisition. We instantiate RPM on a SAM2-based foundation video model and evaluate it on challenging visual object tracking benchmarks, a representative instance of online video perception. Experimental results demonstrate that our method significantly reduces drift after occlusion, consistently outperforming strong baselines that rely on either static memory or unconditional predictive modeling. These findings establish that predictive memory is beneficial only when its reliability is explicitly modeled, and define a general principle for robust online video perception.