Poster Tue, Jul 7, 2026 • 6:30 PM – 8:15 PM PDT HALL A #1705

video-SALMONN S: Memory-Enhanced Streaming Audio-Visual LLM

Guangzhi Sun ⋅ Yixuan Li ⋅ Xiaodong Wu ⋅ Yudong Yang ⋅ Wei Li ⋅ Zejun MA ⋅ Chao Zhang

Project Page

Abstract

Long-duration streaming video understanding is fundamental for future AI agents, yet remains limited by ineffective long-term memory. We introduce video-SALMONN S, a memory-enhanced streaming audio-visual large language model that processes over 3-hour videos at $1$ FPS and $360$p resolution, outperforming strong non-streaming models under the same memory budget. In addition to token merging or downsampling, video-SALMONN S is the first to employ test-time training (TTT) as a streaming memory mechanism for video understanding. TTT continuously transforms short-term multimodal representations into long-term memory embedded in model parameters. To improve long-range dependency modeling and memory capacity, we propose (i) a TTT$_\text{MEM}$ layer with an additional long-span prediction objective, (ii) a two-stage training scheme, and (iii) a modality-aware memory reader. We further introduce the episodic learning from video memory (ELViM) benchmark, simulating agent-like scenarios where models must learn from videos observed hours earlier. video-SALMONN S consistently outperforms both streaming and non-streaming baselines by 3-7\% on long video benchmarks. Notably, video-SALMONN S achieves a $15\%$ absolute accuracy improvement over strong non-streaming models on ELViM, demonstrating strong learning abilities from video memory.

Lay Summary

Researchers have built video-SALMONN S, a new AI system designed to watch and understand very long videos (over 3 hours) in real time, which is a key capability for future AI assistants like smart glasses that need to remember what happened hours ago. The main innovation is applying a technique called "test-time training," where instead of just storing piles of raw video frames, the AI gradually updates its own internal settings as it watches, turning short-term observations into long-term memory baked into the model itself, similar to how humans consolidate experiences rather than replaying every second. The team also created a new test called ELViM that checks whether an AI can answer questions about the current video it is watching by using knowledge from a video it saw hours earlier, mimicking how a real assistant would need to recall earlier moments of your day. Their system outperforms leading competitors by 3–7% on long video tasks generally and by a striking 15% on this memory test, marking a meaningful step toward AI that can genuinely keep up with hours of continuous experience rather than forgetting what just happened.