Text, camera, action! Frontiers in controllable video generation

Workshop

Text, camera, action! Frontiers in controllable video generation

Michal Geyer · Joanna Materzynska · Jack Parker-Holder · Yuge Shi · Trevor Darrell · Nando de Freitas · Antonio Torralba

[ Abstract ] Workshop Website

[ Project Page ]

The past few years have seen the rapid development of Generative AI, with powerful foundation models demonstrating the ability to generate new, creative content in multiple modalities. Following breakthroughs in text and image generation, it is clear the next frontier lies in video. One challenging but compelling aspect unique to video generation is the various forms in which one could control such generation: from specifying the content of a video with text, to viewing a scene with different camera angles, or even directing the actions of characters within the video. We have also seen the use cases of these models diversify, with works that extend generation to 3D scenes, use such models to learn policies for robotics tasks or create an interactive environment for gameplay. Given the great variety of algorithmic approaches, the rapid progress, and the tremendous potential for applications, we believe now is the perfect time to engage the broader machine learning community in this exciting new research area. We thus propose the first workshop on Controllable Video Generation (CVG), focused on algorithms that can control videos with multiple modalities and frequencies, and the swathe of potential applications. We anticipate CVG would be uniquely relevant to ICML as it brings together a variety of different communities: from traditional computer vision, to safety and alignment, to those working on world models in a reinforcement learning or robotics setting. This makes ICML the perfect venue, where seemingly unrelated communities can join together and share ideas in this new emerging area of AI research.

Chat is not available.

Timezone: America/Los_Angeles

Schedule

Sat 12:00 a.m. - 12:05 a.m.	Introduction and Opening Remarks ( Introduction ) > SlidesLive Video	Joanna Materzynska 🔗
Sat 12:05 a.m. - 12:40 a.m.	Andreas Blattman - Pre-Training Rectified Flow Transformer Models for Controllable Video Generation ( Invited Talk ) > SlidesLive Video	Andreas Blattman 🔗
Sat 12:40 a.m. - 1:00 a.m.	Slicedit: Zero-Shot Video Editing With Text-to-Image Diffusion Models Using Spatio-Temporal Slices ( Oral Presentation ) > link SlidesLive Video Link	Nathaniel Cohen 🔗
Sat 1:00 a.m. - 1:30 a.m.	Coffee break	🔗
Sat 1:30 a.m. - 2:10 a.m.	Ashley Edwards - Learning actions, policies, rewards, and environments from videos alone ( Invited Talk ) > SlidesLive Video	Ashley Edwards 🔗
Sat 2:10 a.m. - 2:50 a.m.	Tali Dekel - The Future of Video Generation: Beyond Data and Scale ( Invited Talk ) > SlidesLive Video	Tali Dekel 🔗
Sat 2:50 a.m. - 3:10 a.m.	Diverse and aligned audio-to-video generation via text-to-video model adaptation ( Oral Presentation ) > SlidesLive Video	Idan Schwartz 🔗
Sat 3:10 a.m. - 4:30 a.m.	Lunch Break	🔗
Sat 4:30 a.m. - 5:30 a.m.	Poster Session	🔗
Sat 5:30 a.m. - 6:10 a.m.	Sander Dieleman - Wading through the noise: an intuitive look at diffusion models ( Invited Talk ) > SlidesLive Video	Sander Dieleman 🔗
Sat 6:10 a.m. - 6:30 a.m.	EgoSim: Egocentric Exploration in Virtual Worlds with Multi-modal Conditioning ( Oral Presentation ) > SlidesLive Video	Wei Yu 🔗
Sat 6:30 a.m. - 6:50 a.m.	Alignment is All You Need: A Training-free Augmentation Strategy for Pose-guided Video Generation ( Oral Presentation ) >	Xiaoyu Jin 🔗
Sat 6:50 a.m. - 7:30 a.m.	William (Bill) Peebles - TBD ( Invited Talk ) > SlidesLive Video	William Peebles 🔗
Sat 7:30 a.m. - 8:00 a.m.	Boyi Li - Leveraging LLMs to Imagine Like Humans by Aligning Representations from Vision and Language ( Invited Talk ) > SlidesLive Video	Boyi Li 🔗
-	A Systematic Comparison of fMRI-to-video Reconstruction Techniques ( Poster ) > link Link	Camilo Fosco · Benjamin Lahner · Alex Andonian · Bowen Pan · Aude Oliva 🔗
-	Vid3D: Synthesis of Dynamic 3D Scenes using 2D Video Diffusion ( Poster ) > link Link	Rishab Parthasarathy · Zachary Ankner · Aaron Gokaslan 🔗
-	Fréchet Video Motion Distance: A Metric for Evaluating Motion Consistency in Videos ( Poster ) > link Link	Jiahe Liu · Youran Qu · Qi Yan · xiaohui zeng · Lele Wang · Renjie Liao 🔗
-	EgoSim: Egocentric Exploration in Virtual Worlds with Multi-modal Conditioning ( Poster ) > link Link	Wei Yu · Songheng Yin · Steve Easterbrook · Animesh Garg 🔗
-	EgoSim: Egocentric Exploration in Virtual Worlds with Multi-modal Conditioning ( Oral ) > link Link	Wei Yu · Songheng Yin · Steve Easterbrook · Animesh Garg 🔗
-	Jafar: An Open-Source Genie Reimplemention in Jax ( Poster ) > link Link	Timon Willi · Matthew T Jackson · Jakob Foerster 🔗
-	Slicedit: Zero-Shot Video Editing With Text-to-Image Diffusion Models Using Spatio-Temporal Slices ( Poster ) > link Link	Nathaniel Cohen · Vladimir Kulikov · Matan Kleiner · Inbar Huberman-Spiegelglas · Tomer Michaeli 🔗
-	Slicedit: Zero-Shot Video Editing With Text-to-Image Diffusion Models Using Spatio-Temporal Slices ( Oral ) > link Link	Nathaniel Cohen · Vladimir Kulikov · Matan Kleiner · Inbar Huberman-Spiegelglas · Tomer Michaeli 🔗
-	Large Language Models are Frame-Level Directors for Zero-Shot Text-to-Video Generation ( Poster ) > link Link	Susung Hong · Junyoung Seo · Heeseong Shin · Sunghwan Hong · Seungryong Kim 🔗
-	Diverse and aligned audio-to-video generation via text-to-video model adaptation ( Poster ) > link Link	Guy Yariv · Idan Schwartz · Itai Gat · Yossi Adi · Sagie Benaim · Lior Wolf 🔗
-	Diverse and aligned audio-to-video generation via text-to-video model adaptation ( Oral ) > link Link	Guy Yariv · Idan Schwartz · Itai Gat · Yossi Adi · Sagie Benaim · Lior Wolf 🔗
-	Overcoming Knowledge Barriers: Online Imitation Learning from Observation with Pretrained World Models ( Poster ) > link Link	Xingyuan Zhang · Philip Becker-Ehmck · Patrick van der Smagt · Maximilian Karl 🔗
-	Alignment is All You Need: A Training-free Augmentation Strategy for Pose-guided Video Generation ( Poster ) > link Link	Xiaoyu Jin · Zunnan Xu · Mingwen Ou · Wenming Yang 🔗
-	Alignment is All You Need: A Training-free Augmentation Strategy for Pose-guided Video Generation ( Oral ) > link Link	Xiaoyu Jin · Zunnan Xu · Mingwen Ou · Wenming Yang 🔗