video-SALMONN S: Memory-Enhanced Streaming Audio-Visual LLM
Guangzhi Sun ⋅ Yixuan Li ⋅ Xiaodong Wu ⋅ Yudong Yang ⋅ Wei Li ⋅ Zejun MA ⋅ Chao Zhang
Abstract
Long-duration streaming video understanding is fundamental for future AI agents, yet remains limited by ineffective long-term memory. We introduce video-SALMONN S, a memory-enhanced streaming audio-visual large language model that processes over 3-hour videos at $1$ FPS and $360$p resolution, outperforming strong non-streaming models under the same memory budget. In addition to token merging or downsampling, video-SALMONN S is the first to employ test-time training (TTT) as a streaming memory mechanism for video understanding. TTT continuously transforms short-term multimodal representations into long-term memory embedded in model parameters. To improve long-range dependency modeling and memory capacity, we propose (i) a TTT$_\text{MEM}$ layer with an additional long-span prediction objective, (ii) a two-stage training scheme, and (iii) a modality-aware memory reader. We further introduce the episodic learning from video memory (ELViM) benchmark, simulating agent-like scenarios where models must learn from videos observed hours earlier. video-SALMONN S consistently outperforms both streaming and non-streaming baselines by 3-7\% on long video benchmarks. Notably, video-SALMONN S achieves a $15\%$ absolute accuracy improvement over strong non-streaming models on ELViM, demonstrating strong learning abilities from video memory.
Successful Page Load