MoVie: Multimodal Video Compression with Text Guidance
Jiaqi Hu ⋅ Haoji Hu ⋅ Heming Sun ⋅ Lianrui Mu
Abstract
Most deep video codecs emphasize low-level motion modeling and remain largely semantics-agnostic, which can degrade perceptual quality in complex scenes. We propose **MoVie**, a **M**ultim**o**dal **Vi**d**e**o compression framework built on a Text-guided Video Transformer–CNN Mixed block (*Text-VideoTCM*). MoVie adopts a video-centric architecture that jointly models local spatial structures and temporal dynamics via window-based processing, delivering a favorable computation--perception trade-off. To incorporate semantics, we introduce dual-stage text fusion with *Extractor* and *Injector* modules. We further present history-conditioned coding that leverages both previous and aggregated historical frames, and a spatial--channel factorized entropy model that estimates probabilities over spatial neighborhoods and channel groups for adaptive bit allocation. Together, these designs reduce redundancy and improve rate control and temporal coherence, yielding reconstructions at low bitrates. On UVG and MCL-JCV, MoVie achieves **$-$50.23\%** BD-rate for FID and **$-$14.64\%** for LPIPS (VGGNet) relative to HM, while requiring only **55.76\%** of DCVC-FM's per-pixel kMACs.
Successful Page Load