Expo Talk Panel
VideoFlexTok: Flexible-Length Coarse-to-Fine Video Tokenization
Lauren Araujo
AUDITORIUM
A standard approach to representing a video is via a fixed spatiotemporal grid of tokens corresponding to the original 3D structure of the signal. These tokenization approaches, however, result in a fixed-length token sequence that is independent of the underlying input complexity. In addition, this grid structure biases tokens to focus on and capture local information from the original signal. In this work, we develop a tokenizer that learns to represent an input video in a coarse-to-fine manner, where early tokens encode the most salient semantic features of the whole video, while later tokens incrementally refine the representation with more fine-grained details. Additionally, we introduce an autoregressive temporal loss over the learned tokens that serves two purposes: first, it makes the tokens more suitable for subsequent autoregressive video modeling; second, it encourages the learning of higher-level abstractions that are more predictable over time. We study the representations learned through this process and evaluate their usefulness for downstream applications such as video modeling.
Live content is unavailable. Log in and register to view live content