CAST: Modeling Visual State Transitions for Consistent Video Retrieval
Yanqing Liu ⋅ Yingcheng Liu ⋅ Fanghong Dong ⋅ Budianto Budianto ⋅ Cihang Xie ⋅ Yan jiao
Abstract
As video content creation shifts towards long-form narratives, retrieving and composing short clips into coherent storylines becomes a critical challenge. Standard retrieval formulations, however, perform context-agnostic retrieval, prioritizing local semantic alignment while neglecting procedural state and identity consistency across time. To address this, we introduce the task of Consistent Video Retrieval (CVR) and establish a rigorous benchmark across YouCook2, COIN, and CrossTask, designed to explicitly evaluate temporal and identity consistency. We propose CAST (Context-Aware State Transition), a lightweight, embedding-agnostic adapter that models procedural progression by predicting a state-conditioned residual update ($\Delta$) from visual history, decoupling procedural progression from static identity. Extensive experiments demonstrate that CAST yields significant and consistent gains across diverse datasets over standard baselines. Furthermore, we showcase its potential as a plug-and-play consistency verifier, guiding black-box generation models (e.g., Sora, Veo) toward physically plausible continuations.
Successful Page Load