DynaMem: Consistent Long Video Generation via Hierarchical Memory and Motion Priors
Abstract
Recent text-to-video diffusion models can synthesize visually compelling clips from natural language prompts. However, practical applications increasingly demand long-form videos with evolving narratives and persistent identity. A common solution is autoregressive generation, where the video is produced clip by clip over long horizons, yet coherence often degrades as errors compound. In this work, we study long-video generation under an autoregressive setting, where videos are synthesized clip by clip over long horizons. Despite strong short-clip quality, existing approaches often suffer from semantic drift, motion decay, and appearance instability as the sequence grows. We present DynaMem, a unified framework that improves long-horizon coherence via three components: Semantic-Adaptive Hierarchical Memory for long-range semantic preservation, Dynamics-Prioritized Optimization for motion-coherent learning, and Reference-Anchored Perceptual Alignment for stabilizing appearance. Extensive experiments show that DynaMem produces more consistent semantics, stronger temporal dynamics, and more stable appearance on long videos compared to competitive baselines.