Oral

VideoPoet: A Large Language Model for Zero-Shot Video Generation

Dan Kondratyuk ⋅ Lijun Yu ⋅ Xiuye Gu ⋅ Jose Lezama ⋅ Jonathan Huang ⋅ Grant Schindler ⋅ Rachel Hornung ⋅ Vighnesh N Birodkar ⋅ Jimmy Yan ⋅ Ming-Chang Chiu ⋅ Krishna Somandepalli ⋅ Hassan Akbari ⋅ Yair Alon ⋅ Yong Cheng ⋅ Joshua V Dillon ⋅ Agrim Gupta ⋅ Meera Hahn ⋅ Anja Hauth ⋅ David Hendon ⋅ Alonso Martinez ⋅ David Minnen ⋅ Mikhail Sirotenko ⋅ Kihyuk Sohn ⋅ Xuan Yang ⋅ Hartwig Adam ⋅ Ming-Hsuan Yang ⋅ Irfan Essa ⋅ Huisheng Wang ⋅ David Ross ⋅ Bryan Seybold ⋅ Lu Jiang

Best Paper

2024 Oral

Abstract

We present VideoPoet, a language model capable of synthesizing high-quality video from a large variety of conditioning signals. VideoPoet employs a decoder-only transformer architecture that processes multimodal inputs -- including images, videos, text, and audio. The training protocol follows that of Large Language Models (LLMs), consisting of two stages: pretraining and task-specific adaptation. During pretraining, VideoPoet incorporates a mixture of multimodal generative objectives within an autoregressive Transformer framework. The pretrained LLM serves as a foundation that can be adapted for a range of video generation tasks. We present empirical results demonstrating the model's state-of-the-art capabilities in zero-shot video generation, specifically highlighting the ability to generate high-fidelity motions. Project page: http://sites.research.google/videopoet/

Video

Chat is not available.