OmniShow: Orchestrating Multimodal Conditions for Human-Object Interaction Video Generation
Donghao Zhou ⋅ Guisheng Liu ⋅ Hao Yang ⋅ Jiatong Li ⋅ Jingyu Lin ⋅ Xiaohu Huang ⋅ Yichen Liu ⋅ Xin Gao ⋅ Cunjian Chen ⋅ Shilei Wen ⋅ Chi Wing Fu ⋅ Pheng Ann Heng
Abstract
In this work, we study Human-Object Interaction Video Generation (HOIVG), which aims to synthesize high-quality HOI videos via text, reference image, audio, and pose conditions. To address the challenges of harmonious multimodal injection and heterogeneous data utility, we present OmniShow, an end-to-end framework tailored for HOIVG. We introduce Unified Channel-wise Conditioning to efficiently inject image and pose cues, Gated Local-Context Attention to ensure precise audio-visual synchronization, and a Decoupled-then-Joint Training strategy to effectively harness heterogeneous data. Extensive experiments on the proposed HOIVG-Bench demonstrate that OmniShow achieves state-of-the-art performance.
Successful Page Load