Skip to yearly menu bar Skip to main content


Poster

VideoPoet: A Large Language Model for Zero-Shot Video Generation

Dan Kondratyuk · Lijun Yu · Xiuye Gu · Jose Lezama · Jonathan Huang · Grant Schindler · Rachel Hornung · Vighnesh N Birodkar · Jimmy Yan · Ming-Chang Chiu · Krishna Somandepalli · Hassan Akbari · Yair Alon · Yong Cheng · Joshua V Dillon · Agrim Gupta · Meera Hahn · Anja Hauth · David Hendon · Alonso Martinez · David Minnen · Mikhail Sirotenko · Kihyuk Sohn · Xuan Yang · Hartwig Adam · Ming-Hsuan Yang · Irfan Essa · Huisheng Wang · David Ross · Bryan Seybold · Lu Jiang

Hall C 4-9 #608
Best Paper Best Paper
[ ] [ Project Page ] [ Paper PDF ]
Tue 23 Jul 2:30 a.m. PDT — 4 a.m. PDT
 
Oral presentation: Oral 1D Video
Tue 23 Jul 1:30 a.m. PDT — 2:30 a.m. PDT

Abstract:

We present VideoPoet, a language model capable of synthesizing high-quality video from a large variety of conditioning signals. VideoPoet employs a decoder-only transformer architecture that processes multimodal inputs -- including images, videos, text, and audio. The training protocol follows that of Large Language Models (LLMs), consisting of two stages: pretraining and task-specific adaptation. During pretraining, VideoPoet incorporates a mixture of multimodal generative objectives within an autoregressive Transformer framework. The pretrained LLM serves as a foundation that can be adapted for a range of video generation tasks. We present empirical results demonstrating the model's state-of-the-art capabilities in zero-shot video generation, specifically highlighting the ability to generate high-fidelity motions. Project page: http://sites.research.google/videopoet/

Chat is not available.