Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Text, camera, action! Frontiers in controllable video generation

Large Language Models are Frame-Level Directors for Zero-Shot Text-to-Video Generation

Susung Hong · Junyoung Seo · Heeseong Shin · Sunghwan Hong · Seungryong Kim

Keywords: [ Zero-Shot Text-to-Video Generation ]


Abstract:

In the paradigm of AI-generated content (AIGC), there has been increasing attention to transferring knowledge from pre-trained text-to-image (T2I) models to text-to-video (T2V) generation. Despite their effectiveness, these frameworks face challenges in maintaining consistent narratives and handling shifts in scene composition or object placement from a single abstract user prompt. Exploring the ability of large language models (LLMs) to generate time-dependent, frame-by-frame prompts, this paper introduces a new framework, dubbed DirecT2V. DirecT2V leverages instruction-tuned LLMs as directors, enabling the inclusion of time-varying content and facilitating consistent video generation. To maintain temporal consistency and prevent mapping the value to a different object, we equip a diffusion model with a novel value mapping method and dual-softmax filtering.

Chat is not available.